Sie sind auf Seite 1von 8

IRMA: A Reliable Multicast Architecture for the Internet

Kang-Won Lee Sungwon Ha Vaduvur Bharghavan University of Illinois at Urbana-Champaign Email: kwlee, s-ha, bharghav
Abstract IRMA is a reliable multicast architecture that guarantees reliable, sequenced, and loosely synchronized delivery of multicast streams. IRMA provides ACK-based reliability, hybrid local server-initiated and sender-initiated loss recovery, and support for end-to-end ow control and congestion control. Unlike most contemporary work, IRMA supports TCP as the reliable multicast transport protocol at the end host without modications. IRMA has been instantiated in a laboratory testbed, and its performance has been measured in various scenarios. Preliminary performance results show that IRMA is efcient and adaptive to the dynamics of the network.

I. I NTRODUCTION Recent years have witnessed a tremendous increase in the use of the Internet for a large variety of applications including commerce, web access, software distribution, multimedia, and of course, data communication. Many of these applications require reliable data transmission from a sender to multiple receivers. While it is possible to establish multiple unicast TCP connections and accomplish such data transmissions between each sender-receiver pair, this approach has two distinct disadvantages: (a) it involves duplicating packet transmissions for connections which may potentially be able to share common links, and (b) it requires an application-level synchronization between the sender and the receiver set. For these reasons, reliable multicast is gaining popularity as a highly desirable feature of the future Internet. A number of reliable multicast protocols have been proposed recently [1],[2],[3],[4],[5],[6]. The basic idea of most of these protocols is to use the IP multicast infrastructure for routing, and add functionality at the end hosts, and possibly at the multicast routers, in order to support reliable multicast. These protocols fall into broad categories based on whether they support sequenced [1], [3] or un-sequenced delivery [2], [4]; fully reliable [1],[3] or mostly reliable delivery [2]; loose synchronization among receivers [3],[4],[6] or allowing some receivers to lag far behind others [1],[2]; and support for application-dependent reliability and sequencing semantics [4] or providing guaranteed reliability and sequencing semantics [1],[3]. Our goal is to provide an architecture that guarantees reliable, sequenced, and loosely synchronized delivery of multicast streams with support for ow control and congestion control. Specically, we argue for the Illinois Reliable Multicast Architecture (IRMA) in which the sender does not maintain perreceiver state, with ACK-based reliability, hybrid local serverinitiated and sender-initiated error recovery, and end-to-end reliability semantics. Unlike most contemporary work, we show that this architecture can efciently support TCP as the reliable multicast transport protocol at the end host without modications. Thus, IRMA does not require end hosts to install new software in order to enable reliable multicast essentially, the reliable multicast infrastructure makes the TCP/IP protocol stack at

the end hosts believe that the communication is unicast, while in fact, multiple receivers can participate in the reliable reception of the data. IRMA achieves end host transparency by introducing additional functionality in a subset of the multicast routers in the multicast tree (in the extreme case, only the multicast routers nearest the sender and the receivers need to be IRMA-aware). We have taken the above approach for three main reasons: (a) the principles of establishing reliable packet delivery, congestion control and ow control have already been extensively studied in the context of unicast TCP communication, and can be effectively extended to multicast communication, (b) for practical reasons, widespread deployment of a new protocol takes time, and requires showing demonstrable robustness and efciency over a large set of complex scenarios in the Internet; we instead reuse the proven TCP/IP protocol for reliable multicast, and (c) most contemporary approaches treat reliable multicast as a specialized service which needs to be instantiated in participating end hosts; instead we treat reliable multicast as a ubiquitous service for a large number of hosts in the Internet merely by adding functionality to a relatively small subset of multicast routers. The following are the key ideas presented in this paper: We propose a new architecture, called IRMA, in which a virtual network of multicast routers cooperatively provides reliable multicast services transparent to the end host TCP/IP protocol stack. We describe an implementation of IRMA and address key design-related and implementation-related issues. We present preliminary performance results in our testbed, which indicate that reliable multicast using IRMA is as fast as the unicast TCP connection from the sender to the slowest receiver in the multicast group at any time. We believe that IRMA enables a simple, easily deployable, and efcient approach for providing reliable multicast in the Internet. The rest of the paper is organized as follows: Section II outlines the issues in IP multicast and TCP that impact the design of IRMA. Section III discusses the different approaches to reliable multicast and presents our design choices. Section IV describes the IRMA reliable multicast architecture. Section V describes an implementation of IRMA. Section VI presents performance results of IRMA in a laboratory testbed. Section VII summarizes the paper. II. BACKGROUND IRMA relies on the IP multicast infrastructure in the network for routing and group membership notication, and supports the use of TCP as the multicast transport protocol at the end hosts. We now describe relevant parts of these two technologies.

A. IP Multicast IP multicast is achieved by a virtual network of multicast routers (MBONE), which establish tunnels between themselves and collectively route multicast packets in the network. Every subnetwork which has the capability to send and/or receive multicast packets must have a multicast router instantiated in the subnetwork. IP multicast requires two components: a multicast routing protocol (e.g., DVMRP, MOSPF, PIM, CBT, Hierarchical DVMRP) and a group membership protocol (i.e., IGMP) [7]. A multicast routing protocol generates a directed tree in the MBONE. IGMP enables each multicast router to determine on which of its interfaces it should forward an incoming multicast packet because there are group members downstream. As a result of IGMP, the multicast router only knows whether or not there are any receivers for a multicast group on an interface, but neither the number nor the identity of the receivers. B. Properties of TCP TCP provides a sequenced and reliable unicast communication abstraction with end-to-end ow control and congestion control. There are four key instruments in TCP: (a) connection establishment is achieved via a 3-way handshake, and must precede data transmission; likewise, connection termination involves 2-way handshakes and precludes further data transmission, (b) sequenced and reliable data delivery is achieved via cumulative acknowledgements (optionally, selective acknowledgement (SACK)) from the receiver, and a go-back-N retransmission (or selective retransmission) policy at the sender, (c) ow control is achieved via receiver window advertisements in the acknowledgement packets, which bounds the number of unacknowledged bytes in transit from the sender to the receiver, and (d) congestion control is achieved by a combination of four mechanisms: round-trip time estimation, slow start, congestion avoidance, and fast retransmit/fast recovery [8]. TCP is inherently unicast in nature and cannot be used for reliable multicast communication without special network support for a variety of reasons: The 3-way handshake is inherently unicast, involving the exchange of initial sequence numbers (ISNs) and maximum segment size (MSS). Since data transmission cannot precede connection establishment or follow connection termination, receivers cannot dynamically join or leave a reliable multicast session. In addition, many implementations of TCP explicitly prevent the use of multicast address in the sender and receiver address eld, although the TCP RFC does not prohibit the use of multicast address as a connection end point. An ACK with a large sequence number from a receiver on a fast connection may cause the sender to move its window forward even if some other receiver lags behind. Likewise, fast receivers may advertise large receiver windows while others may advertise small windows. Even if we handle this problem by maintaining per-receiver state at the sender, the ACK implosion problem remains. Fundamentally, TCP expects connection establishment to precede data transfer, and a TCP sender expects to receive cumulative acknowledgements from a single receiver. These assumptions are violated when we allow multiple receivers. However,

we found that the ow control and congestion control mechanisms of TCP can be naturally extended to support multicast communication by performing the cumulative aggregation of ACKs over the multicast tree. Additionally local recovery can be used to improve loss recovery along congested paths (subtrees) and handle random loss in order to ensure that slow links do not stall an entire session. Thus, there are strong motivations to provide a support for TCP/IP protocol at the end host in order to achieve reliable multicast. III. I SSUES IN S UPPORTING R ELIABLE M ULTICAST In this section, we discuss three key issues in supporting reliable multicast: the semantics of reliable multicast, NAK-based versus ACK-based architectures, and loss recovery mechanisms. A. Semantics of Reliable Multicast Unlike unicast communication, there is no consensus in contemporary research on what exactly reliable multicast semantics should be. Should packets be delivered in sequence? Should a slow receiver stall the entire multicast group? Should the reliability semantics be end-to-end? Recognizing various applications characteristics, we consider three classes of reliable multicast semantics. 1. Strict reliability with loose synchronization: Many applications (e.g., distributed interactive simulation (DIS), broadcasting stock quotes and real-time news) require strict reliability with loose synchronization. Since data reception is synchronized among receivers (e.g., within one congestion window), the reliability semantics can be end-to-end and the transmission rate of a multicast session is constrained to be as slow as the slowest receiver in the multicast group at any time. This end-to-end semantics is desirable over split-connection semantics because the former provides stronger guarantees, and better robustness since it allows for intermediate node failures without disrupting the reliability semantics of the multicast session. 2. Partial reliability with loose synchronization: Some applications send heterogeneous data trafc with partial reliability requirements and multiple priorities in a single multicast session. For instance, MPEG ows consist of reliable control packets, and unreliable I, P, and B frames with different priority levels. A typical video conference application may send control messages, text, and whiteboard data as reliable streams, and video and audio as unreliable streams. In this case, an ideal multicast protocol should ensure that reliable packets are guaranteed in-sequence delivery, and slow receivers do not stall the entire multicast session; rather, each receiver receives (in addition to reliable packets) as many high priority unreliable packets of the heterogeneous ows as its connection quality can sustain. 3. Strict/partial reliability with no synchronization: If the heterogeneity in a multicast group is high, then split-connection semantics with no synchronization among receivers is suitable. For example, consider multicasting a le to 100 receivers where 99 receivers are connected via 100 Mbps Ethernet and one receiver via 28.8 Kbps modem. In this case, pacing the multicast session to the slow receivers speed is clearly inefcient. Instead, even if we sacrice the end-to-end semantics, it is desirable to serve fast receivers at their rate and allow slow receivers to catch up eventually. By caching the data stream at some in-

termediate network nodes, we can also allow late join receivers to catch up from the the start of the multicast session as well as supporting split-connection semantics. For IRMA, we have chosen the rst reliability semantics in order to support applications with strong reliability requirements. At the same time, IRMA can support the two looser semantics by the following mechanisms: IRMA can accommodate split-connection semantics by converting repair servers (which are used for local loss recovery in IRMA) into data servers for slow receivers. In concert with a layered multicast mechanism such as RLM [5], IRMA can terminate persistently lagging receivers, thereby preventing receivers connected via slow links from joining low priority layers (which requires higher bandwidth). In order to support partial reliability with loose synchronization semantics, IRMA can be extended to support adaptive transport protocols for heterogeneous data stream such as HPF [9]. B. NAK-based versus ACK-based Reliable Multicast Several contemporary approaches have argued for a NAKbased reliability scheme with NAK suppression because it avoids the feedback implosion problem. NAK-based schemes work well under two conditions: (a) the NAK-channel is reliable, otherwise the NAK-based schemes will not work correctly; and (b) the delay between any two receivers is small compared to the delay between the sender and the receivers, otherwise the timerbased NAK will not work effectively. In the Internet environments, both of the above conditions may be violated. Besides, as pointed out in [10], NAK-based schemes require an innite buffer at the sender in order to guarantee correct operation while ACK-based schemes can guarantee reliable transmission with bounded buffers. We believe that for a dynamic environment such as the Internet, a conservative ACK-based scheme is superior (with SACKs (or NAKs) as enhancements to the basic ACK mechanism). The key drawback of this approach is that it requires support at multicast routers in the network for ACK aggregation to prevent the ACK implosion problem. Of course, both the ACK-based schemes and NAK-based schemes involve signicant overhead in order to guarantee reliable delivery and sequencing. While the overheads of ACK-based scheme, such as ACK aggregation, have been well documented in related work, NAK-based scheme also involve overheads in terms of timer management, distance estimation, packet retransmission, application level buffering, sequencing, and synchronization, etc. [4]. We have adopted an ACK-based approach in IRMA since we concluded that it is better applicable for the Internet environment. C. Loss Recovery Once a receiver detects a packet loss, how does loss recovery take place? There are three possible solutions depending on who is responsible for the retransmission: sender-initiated, local server-initiated, or receiver-initiated scheme. In senderinitiated recovery, the NAK/ACK goes all the way back to the sender, which then retransmits the lost packet. In local serverinitiated recovery (henceforth called local recovery), some intermediate node in the multicast tree called repair server caches data packets, and retransmits packets locally without waiting for

the senders retransmission. In receiver-initiated recovery any receiver who has received and cached a packet can issue a local retransmission upon hearing a loss notication for the packet. Analysis in [11] shows that local recovery is essential for the performance improvement of reliable multicast. In IRMA, since end hosts use the TCP/IP protocol unchanged, receiver-initiated recovery is not an option. Therefore we use local recovery for enhanced performance with sender-initiated recovery as the default option. In IRMA, local recovery plays an important role in supporting the TCP congestion control at the sender. Consider a large multicast group with independent packet loss among receivers. It is possible for the sender to observe consecutive packet losses when in fact no receiver sees consecutive packet losses. For example, consider a case where receiver does not receive packet and receiver does not receive packet
. In this case an ideal behavior of the sender would be to fast retransmit packets and  without shrinking the congestion window twice (which will happen in a naive reliable multicast protocol since the sender cannot distinguish two independent packet losses from two consecutive packet losses at the same receiver). The situation gets aggravated as the multicast group size grows, which raises a serious scalability issue. However, introducing local recovery alleviates this problem since receivers  and are served by their repair servers. Consequently, local recovery effectively handles the problem of random loss. IV. D ESIGN OF THE IRMA A RCHITECTURE In the previous section, we presented issues in supporting reliable multicast and our design choices. In this section, we rst present a high level sketch of our reliable multicast architecture, then describe the details of its design, divided into the following issues: (a) connection management, (b) multicast tree management, (c) ACK aggregation and support for ow control, (d) multicast congestion control, and (e) local recovery. We make three simplifying assumptions for the purposes of explanation of IRMA: (a) while IRMA can support multiple senders to the same multicast group, we only consider a single sender case, (b) while only a subset of nodes in the multicast tree needs to be IRMA-aware, we use the term multicast router or node to imply IRMA multicast router, and (c) while IRMA supports SACK aggregation, we do not present the algorithm for SACK in this paper. Briey, IRMA works as follows: IP multicast routing algorithm generates the multicast tree for a reliable multicast session. Data packets from the sender are multicast to receivers via this tree. ACKs from the receivers are sent back to the sender via the same tree, but in the opposite direction. ACKs from multiple receivers are aggregated such that the TCP semantics of cumulative acknowledgements is maintained each node in the multicast tree forwards an ACK with the minimum of all the sequence numbers and an estimated advertised window from the ACKs it has received from downstream. We preserve the TCP semantics for sequencing, reliability, ow control, and congestion control. Multicast transmission differs from unicast transmission in two key ways: (a) receivers may join or leave while a multicast session is in progress, and (b) different receivers may lose packets

source dest

source ACK


... ...


local mapping


MK :

local mapping source







physical link multicast tunnel downstream multicast tree, M upstream multicast tree, M end host, non-member end host, member of m IRMA multicast router plain multicast router

Fig. 1. The Reliable Multicast Architecture. Host (with unicast address = has established a reliable multicast connection with hosts in a corresponding to the multicast group with a receiver set = , the root multicast address = Sender is in the subnet of of the multicast tree, and each receiver is in the subnet of some IRMA multicast router ( ). The downstream multicast tree ( ) traversed by the data packets is shown in the black arrow, and the upstream multicast tree ( ) traversed by the ACK packets is shown in the white arrow.


   ! "  "%$



senders to the same multicast group, but different leaf multicast routers may choose different mappings (as shown in Figure 1). Processing of SYN+ACK is a special case at the leaf multicast router because each receiver randomly generates an initial sequence number (ISN) for SYN+ACK, and expects to see this value acknowledged in the return ACK. Thus, the leaf multicast router caches the ISN number from each receiver in its subnet. When the leaf multicast router receives the ACK of the 3-way handshake from the source, it encapsulates the ACK in a unicast packet and delivers it to each receiver in its subnet, replacing the sequence number in the ACK with the cached ISN corresponding to the receiver. SYN+ACK aggregation ensures that no receiver sees a piggybacked ACK sequence number in data packets greater than its ISN, thereby ensuring original TCP semantics. Connection termination initiated by the sender is handled in the same way as the connection establishment by the sender. Connection termination by the receiver (dynamic leave) is discussed in the next subsection. A.2 Dynamic Joins and Leaves In order to perform ACK aggregation, a leaf multicast router needs to know the identity of all the receivers in its subnet. However, since the IGMP join message does not provide this information, we designate a special multicast address, which is currently unassigned [12], for reliable multicast join. When an end host wants to join a reliable multicast group , , it multicasts a request for join to the reliable multicast join address. The local multicast router picks up this packet, and discovers both the identity of the requester and the multicast group it wants to join. The leaf multicast router then initiates the joining process. If the multicast router is not already in the multicast tree for the group, it rst joins the tree using the IP multicast routing algorithm. The join point for the new receiver in the multicast tree, initiates the 3-way handshake by sending out the SYN packet cached during the connection establishment. The sequence number advertised in the SYN packet is > greater than the last data packet that has been transmitted by the joint point, where > is set to a large enough value to compensate the delay of 3-way handshake. Essentially, our goal is to ensure that a new join does not stall on-going session waiting for the 3-way handshake to complete. Thus, dynamic joins do not initiate a full end-to-end connection establishment, and the newly joining receiver does not get a copy of all the packets from the start of the connection but only from the point of joining the group. The SYN packet is delivered unicast to the newly joining receiver because multicasting a SYN during an ongoing session will cause current receivers who hear the SYN packet to reset their connections. Handling dynamic leave (connection termination initiated by the receiver) is similar to dynamic join. Connection termination by the receiver should terminate only the receiver but not the entire session. When the local multicast router receives a FIN packet from a receiver, it responds with an encapsulated FIN+ACK to the receiver and removes the receiver from its local receiver set. If the local receiver set is now empty, then the multicast router prunes itself from the multicast tree and sends a FIN to its parent node. Network reset is handled similar to connection termination.

at different times due to local congestion. We support dynamic joins and leaves in order to handle the former, and we support local recovery in order to handle the latter. Figure 1 shows the overview of the architecture with a simple example. A. Connection Management There are two aspects to connection management: (a) connection establishment and termination, and (b) dynamic joins and leaves. A.1 Connection Establishment and Termination In IRMA, multicast data transfer is one-way, i.e., only (*) communications allowed. In Figure 1, the sender ( with unicast address + initiates the connection request by sending a SYN packet with the multicast destination address , . The multicast router in the subnet of ( , -/. , picks up this packet and forwards it to the downstream nodes in the multicast tree - . Every multicast router in - receives and caches the SYN, and forwards it downstream. When a leaf multicast router -10 receives the multicast SYN packet, it picks a locally unique class E address, 2 , from a cache of usable class E addresses granted to it, creates a dynamic association for 34+657,8592;: , and replaces the source address of the SYN packet, + , with the class E address 2 . This is required because -/0 must trap the SYN+ACK responses from the receivers in order to aggregate them and forward the aggregated SYN+ACK to the source.1 The modied SYN packet is locally multicast to all receivers in the subnet. Each receiver responds with a SYN+ACK packet with destination address 2 . This gets picked up by the local multicast router, which then replaces the destination address 2 to + , and aggregates it back up -=< to - . . - . then unicasts the packet back to ( . Note that 2 needs to be locally unique in the subnet in order to distinguish different

Even if we do not change the sender address to a class E address, IRMA multicast router will be able to capture the SYN+ACK. However, if the SYN+ACK gets picked up by the other routers (than IRMA multicast router), it will be forwarded to the sender, which will undermine the mechanism of ACK aggregation.

B. Multicast Tree Management In this section, we describe two key aspects to the multicast tree management: (a) incorporating state and processing in the nodes of the multicast tree, and (b) generating the upstream multicast tree for ACK aggregation. Each IRMA multicast router maintains the following state for each reliable multicast session (See Figure 2, lines 1 15): (a) the nodes upstream, (b) the node downstream, (c) the cached SYN packet, (d) for each downstream node, the sequence number and the advertised window of the last ACK packet, the number of duplicate ACKs with the last sequence number, and the maximum cached sequence number for the subtree for local recovery (Section IV-E), and (e) the minimum of all sequence numbers acknowledged and the estimated minimum of all advertised windows by nodes downstream. In addition, leaf multicast routers maintain a mapping between the source address and the class E address, 3?+@57,8592A: , and the progress of each receiver: when it last sent an ACK, and how many packet retransmits have been issued for the next packet in sequence for the receiver (used for network-initiated connection termination (Section IV-C)). To support local recovery, a repair server may cache all data packets with a sequence number that is greater than the minimum ACK sequence number for all its children. It is important to note that the size of local state in the multicast router is independent of the size of the multicast group. ACKs need to traverse through exactly the same tree as the multicast tree but in the opposite direction because they need to be aggregated. When a multicast router receives a SYN packet with source address + and destination address , , it generates a special local multicast routing table entry for the source address in the packet. Essentially, the routing table entry consists of 3?+@57B: , where + is the senders address and is the incoming interface for + . This routing table entry is used to determine the outgoing interface for any ACK packet with a multicast source address and a unicast destination address only. C. ACK Aggregation and Support for Flow Control In order to eliminate the ACK implosion problem and preserve the TCP semantics of cumulative acknowledgement and receiver window advertisement, we extend the same concepts to multicast connections with the semantics of minimum ACK sequence number and minimum receiver window for a sub-tree. Thus, each node in the multicast tree uses the minimum of all the received sequence numbers from its nodes downstream, and the estimated minimum of all the receiver windows from its nodes downstream, when it sends an ACK to its upstream node (Figure 2, lines 39, 40, and 47). Since the receiver with the smallest sequence number (the receiver on the slowest link) and the receiver with the smallest available buffer (the slowest receiver) may be different, the calculation of the estimated minimum window is separate from that of the minimum sequence number (Figure 2, lines 10 13). This preserves the end-to-end reliability semantics as well as the end-to-end ow control semantics. In addition to the ACK implosion problem, a reliable multicast architecture should handle the receiver exposure problem, in which retransmitted data packets may be repeatedly sent over

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

state of multicast router

0 0 CED

node immediately upstream, parent( ) set of nodes immediately downstream, children( ) cached SYN for each downstream node children( ) // the last ACK sequence number ack seq( ) adv wnd( ) // the window advertisement num dup( ) // the number of duplicate ACKs // the max cached sequence number for subtree cached seq( ) ack seq( ) ack seq( ) est wnd( ) ack seq( ) + adv wnd( ) min est wnd( ) est wnd( ) min est wnd( ) ack seq( ) adv wnd( ) if leaf router, mapping between sender and class E address: if repair server, cached packet from ack seq( ) to cached seq ( )

C 0 FHG C 0JILKNMPO QJR S TVUVW XY R Z C [PC D 0 0 C F 0]ILKNMPO QJR S TVUVC W X^Y R Z C \ F G C 0 F 0 _ 0 0 D F 0 0 0 0 F

`%0 acbdGebf

reception of ACK (src, dst, ack seq, adv wnd, num dup, cached seq) at interior multicast router // src children( ), dst = ack seq(src) ack seq num dup(src) num dup adv wnd(src) adv wnd cached seq cached seq(src)

// if is a repair server and has packet cached // and none of s children along the path have the packet cached // and retransmit count is reached if (( is repair server) and (cached seq(src) ack seq(src)) and (cached seq( ) ack seq(src)) and (num dup = RETRY COUNT)) // do local recovery send data packet( , src, data(ack seq(src)))

0 h

reception of ACK (src, dst, ack seq, adv wnd) at leaf multicast router // src children( ), dst = class E address corresponds to sender

ack seq(src) ack seq num dup(src) ++ adv wnd adv wnd(src) 0 cached seq(src)

sending ACK at node ack seq adv wnd


ack seq( ) adv wnd( )

// pick max num dup among all children with min ack seq num dup max (num dup among children with min ack seq) // pick min cached seq among all children with min ack seq cached seq min (cached seq among children with min ack seq)

F 0 if ( is a repair server) cached seq F max (cached seq, cached seq( 0 )) send ( 0 , parent( 0 ), ack seq, adv wnd, num dup, cached seq)

Fig. 2. Pseudo code of multicast router state, receive, and send functions

sub-trees that have already received and successfully ACKed the packet. This is handled by not forwarding a data packet with sequence number i to any destination node whose ACK sequence number exceeds i . One side effect of ACK aggregation is that a slow, partitioned, or stalled receiver can disrupt the entire multicast session. While IRMA is constrained to be as slow as the slowest receiver in order to provide loose synchronization, it should not be susceptible to network partitions or misbehaving receivers. In IRMA, the leaf multicast routers have the ability to detect and terminate stalled receivers. Essentially, our goal is to provide loose synchronization among receivers, but we do provide the network the ability to terminate connections for misbehaving, partitioned, or very slow end hosts and thereby provide acceptable service to the other receivers in the network.

D. Support for Congestion Control IRMA assumes that the TCP congestion control at the sender can also handle the congestions in multicast sessions effectively. The key mechanism we add is the functionality in the multicast routers to support fast retransmit. In TCP, when the sender receives three duplicate acknowledgements, it determines that a packet was lost, and retransmits that packet without waiting for a timeout. In a reliable multicast, we need to be careful when handling duplicate ACKs because we want to preserve the fast retransmit semantics while not triggering unnecessary false fast retransmits. We support fast retransmission through two complementary mechanisms: (a) duplicate ACK counting, described in this section, and (b) local recovery, described in the next section. The duplicate ACK counting mechanism works as follows. For each child in the multicast tree, a node maintains two variables 3 sequence, count : where sequence is the last sequence number for a cumulative ACK received from the child, and count is the number of duplicate ACKs received from the child with this sequence number. When a multicast router forwards an ACK packet to its parent, it passes a 3 sequence, count : pair. Thus, a multicast router forwards duplicate ACKs only when the duplicate ACKs are sent by the current local bottleneck receiver 2 , otherwise it accumulates them (Figure 2, lines 20, 35, 42). Also note that sending duplicate ACKs in the multicast tree is implicit (except for the root multicast router, which actually sends count number of duplicate ACKs to the sender), because we send the 3 sequence, count : pair rather than sending count number of ACKs for the sequence number. E. Local Recovery Although IRMA is designed to work correctly without local recovery, providing the additional functionality of packet caching and fast local recovery from a repair server near the receivers will enhance the performance of the protocol while preserving the end-to-end reliability semantics as pointed out in Section III-C. As in TCP fast retransmit, if a repair server receives an ACK notication with three duplicate ACKs in the count eld, it will initiate fast retransmit (Figure 2, lines 23 30). However, it also propagates the duplicate ACK count unchanged upstream because local recovery should not interfere with the fast retransmit/fast recovery mechanism of the sender TCP. According to this approach, if there are two repair servers along a path on the multicast tree, the higher level repair server will always retransmit packets to the lower level repair server even though the latter node may already have the packets cached. We solve this problem by augmenting the 3 sequence, count : pair with a third eld, cached sequence number, which indicates the highest sequence number packet that has been cached by any repair server thus far along the path of an ACK. Only repair servers use this eld in order to determine whether they need to initiate a fast retransmit in response to the reception of three duplicate ACKs. Effectively, our algorithm guarantees that only the lowest level repair server along the path of the dunumber is the same as the minimum ACK sequence number of the multicast router.

plicate ACKs that has the required packet cached will initiate fast retransmit (Figure 2, lines 23 30, 45, 46). V. I MPLEMENTATION We have instantiated the IRMA architecture through a combination of kernel-level modications and user-level processes. While our current implementation is specic to the Linux 2.0.x kernel, we believe it is portable to any Unix platform. The implementation was done in two parts: (a) end host modication, and (b) IRMA multicast routers. At the end hosts, we have made a one line modication to the Linux TCP code in order to send/receive TCP packets with multicast source/destination addresses since Linux implementation does not allow it. Apart from this, no change in TCP is required. At the application-level, we have provided a library function that is called by the receiver application to perform the reliable multicast join (Section IV-A.2). The implementation of the IRMA multicast router is detailed below. An IRMA multicast router should be able to capture SYN, FIN, and RST packets to learn about the connection states of on-going multicast sessions, and ACK packets from downstream nodes to perform ACK aggregation. A user-level process called the state manager takes care of these responsibilities. Figure 3 illustrates the structure of an IRMA-aware multicast router.
state manager
raw socket I/F handler I/F handler

user level

ACK20 network1

TCP, UDP IP deny

incoming device 1 BPF incoming device 2 BPF

kernel level
IP firewall

multicast router ACK 20 ACK 23 network3

outgoing device





Fig. 3. Structure of the IRMA multicast router. As an example, we show three interfaces and the corresponding multicast tree. Incoming ACKs for the reliable multicast session are captured by BPF, forwarded to the interface handler, and denied service by the rewall. Outgoing ACKs generated by the state manager are sent to the outgoing interface via raw socket.

j The local bottleneck receiver at a time is the one whose last ACK sequence

State Manager: The state manager maintains the connection state and handles ACK aggregation. The main functions of the state manager are the following: (a) to discover multicast routing information and local virtual interfaces (VIFs), and the nodes immediately downstream, (b) to instantiate a reliable multicast routing table entry for the upstream multicast tree, (c) to fork interface handlers which capture incoming SYN, FIN, RST and ACK packets belonging to a multicast session on each physical interface, (d) to keep track of the TCP connection states for on-going reliable multicast sessions, (e) to perform ACK aggregation, (f) to handle dynamic group membership management, and optionally, (g) to perform local fast retransmit. ACK aggregation: ACK aggregation involves three steps: (a) capturing copies of ACKs using BPF [13], (b) preventing original ACKs from forwarding using IP rewall [14], and (c) sending the aggregate ACK upstream bypassing the TCP protocol stack using raw sockets. Capturing incoming SYN, FIN, RST and ACK packets belonging to a reliable multicast session is done by using the BSD Packet fast receiver Gateway2000 G6-200 TI Travelmate P120

TABLE I E FFECTIVE DATA RATE OF TCP UNICAST AND IRMA TCP unicast fast slow rate rate (Mbps) (Mbps) 8.596 1.154 8.623 1.154 8.533 1.152 7.314 1.154 7.314 1.154 7.314 1.152 8.596 0.960 8.623 0.970 8.533 0.957 0.957 0.960 0.961 0.970 0.952 0.957 0.957 0.960 0.961 0.970 0.952 0.957 IRMA rate (Mbps) 1.153 1.154 1.152 1.149 1.152 1.145 0.959 0.961 0.957 0.950 0.959 0.946 0.863 0.875 0.829 (%) 99.9 100.0 100.0 99.6 99.9 99.4 99.9 99.1 100.0 99.3 99.8 99.3 90.1 91.0 87.0 sender Gateway2000 P5-166

link a

multicast router Gateway2000 G6-200

! /k

link b

slow receiver

test 1 no congestion test 2.a congestion on link test 2.b congestion on link test 3.a congestion on both links test 3.b congestion on both links

Fig. 4. Testbed conguration


Filter (BPF) and the packet capture library (libpcap) that is built upon it. Most BSD-derived kernels support BPF [8], which allows a portability between different platforms. Since BPF forwards a copy of the ltered packet rather than the packet itself, we use the IP rewall to prevent ACKs from getting forwarded by default. IP rewall rules contain the packet type to be ltered and one of three actions for the ltered packets: accept (let the packet pass the rewall), reject (do not accept and send an ICMP message back to the sender), or deny (ignore the packet without sending ICMP message). We deny incoming ACKs, RSTs, and FINs belonging to a reliable multicast session. Aggregated ACKs, and SYNs and FINs generated by dynamic connection management need to be sent out directly through IP without going through the TCP layer. Besides, SYNs and FINs sent as a result of late joins and early leaves need to be delivered to the target host by unicast rather than being multicast to the entire subnet (Section IV-A.2). This can be accomplished using the raw socket, in which the user process can compose its own packet headers. VI. P ERFORMANCE R ESULTS We have instantiated the IRMA architecture in a laboratory testbed consisting of Ethernet and WaveLAN subnetworks. The hosts in our testbed are P6-200 Pentiums and P-120 laptops, all running Linux 2.0.31. While intensive performance testing in our testbed is underway, we report preliminary performance results as an illustration in this version of the paper. To contrast the performance of IRMA in different scenarios, we considered a simple star conguration as shown in Figure 4. There are two receivers: a fast receiver with address, and and a slow receiver with address The multicast router is connected to the sender and the fast receiver by point-to-point 10 Mbps Ethernet links. The multicast router is connected to the slow receiver by a 2 Mbps WaveLAN wireless link. The multicast address to which the receivers belong is In order to compare the performance of the reliable multicast and the unicast TCP for the case of heterogeneous receivers, we performed three tests: (a) no congestion, (b) congestion in either of the links, and (c) congestion on both links. Also we compared the performance of the reliable multicast and the IP multicast in the last test. Table 1 summarizes the result for the rst suite of tests. Throughput was calculated from the transmission time and the amount of the data sent. The rightmost column, m/s, indicates the relative throughput performance of the reliable multicast using IRMA with respect to TCP unicast

avg max min avg max min avg max min avg max min avg max min

1.2e+06 reliable IRMA to multicast group + multicast 1e+06 TCP to slow receiver

TCP to slow receiver

sequence number sequence number





4 5 time (sec) time (sec)

Fig. 5. Performance of TCP unicast to the slow receiver and reliable multicast to both receivers where link gets congested in the middle of data transmission.

to the slower receiver for the test. Each test was repeated 10 times. Test 1: In this test, the sender transmits 1 MB of data at peak speed. The transmission duration was calculated as the time difference between the arrival of the rst SYN packet and the arrival of the last FIN+ACK packet. The results for Test 1 in Table 1 summarize the effective data rate of each protocol. An intuitively obvious result is vindicated: the overall time for multicasting the data le reliably is approximately the same as the overall time for unicasting the data le to the slow receiver. Test 2: In the rst part of this test, the sender transmits 1 MB of data at peak speed. We generate congestion by sending 200 KB of data at full speed on the link + (from the multicast router to the discard port of the fast receiver). As expected, the difference in capacity between the fast path and the slow path (an order of magnitude in bandwidth) is so large that congestion on the fast link has no impact on the overall time of the reliable multicast transmission. The results for Test 2.a in Table 1 summarize the transmission times. In the second part of this test, the sender transmits 1 MB of data at peak speed, and we generate the same congestion on the slow


IRMA to multicast group +


TCP to slow receiver reliable multicast



TCP to slow receiver




avg max min

IP multicast fast receiver slow receiver received rate received rate (packets) (Mbps) (packets) (Mbps) 1000 9.32 213.8 1.07 1000 9.42 216 1.08 1000 9.31 195 0.97

IRMA received (packets) 1000 1000 1000 rate (Mbps) 1.153 1.154 1.152

sequence sequence number number


time(sec) (sec) time


Fig. 6. Performance of TCP unicast to the slow receiver and reliable multicast to both receivers where rst link , and then link gets congested. In reliable multicast case, we see the effect of the second congestion (on link ).

form a ow control and signicant end-to-end recovery when a fast sender can swamp a slow receiver. Table 2 summarizes the effective data rates for this test. As before, data rate was calculated from the transmission time and the amount of data delivered successfully to the receiver. The performance numbers for IRMA are identical to Test 1. VII. S UMMARY In this paper, we have proposed a reliable multicast architecture called IRMA that guarantees reliable, sequenced, and loosely synchronized delivery of multicast streams. IRMA provides ACK-based reliability, hybrid local server-initiated and sender-initiated loss recovery, and support for end-to-end ow control and congestion control. Unlike most contemporary work, IRMA effectively supports TCP as the reliable multicast transport protocol at the end host without any modications. We have presented an implementation of IRMA and preliminary performance results to show that IRMA is efcient and adaptive to the dynamics of the network. R EFERENCES
[1] [2] [3] [4] [5] [6] J. C. Lin and S. Paul, RMTP: A Reliable Multicast Transport Protocol, Proceedings of IEEE INFOCOM 96, March 1996. H. W. Holbrook, S. K. Singhai, and D. R. Cheriton, Log-Based ReceiverReliable Multicast for Distributed Interactive Simulation, Proceedings of ACM SIGCOMM 95, September 1995. A. Koifman and S. Zabele, RAMP: A Reliable Adaptive Multicast Protocol, Fifteenth Annual Joint Conference of the IEEE Computer and Communication Societies, March 1996. S. Floyd, V. Jacobson, C-G Liu, S. McCanne, and L. Zhang, A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing, IEEE/ACM Transactions on Networking, November 1996. S. McCanne and V. Jacobson, Receiver-driven Layered Multicast, Proceedings of ACM SIGCOMM 96, August 1996. R. Talpade and M. H. Ammar, Single Connection Emulation (SCE): An Architecture for Providing a Reliable Mulitcast Transport Service, Proceedings of the 15th IEEE International Conference on Distributed Computing Systems, June 1995. S. Deering, Host extensions for IP multicasting, RFC 1112, August 1989. W. R. Stevens, TCP/IP Illustrated Volume 1, Addison Wesley, March 1996. D. Dwyer, J. Liu, S. Ha, and V. Bharghavan, Transport Layer Adaptation For Supporting Multimedia Flows in the Internet, Proceedings, IEEE Conference on Multimedia Computing Systems 98, June 1998. B. N. Levine J. J. Garcia-Luna-Aceves, A Comparison of Reliable Multicast Protocols, Multimedia Systems (ACM/Springer), August 1998. S. Kasera, J. Kurose, and D. Towsley, A Comparison of Server-Based and Receiver-Based Local Recovery Approaches for Scalable Reliable Multicast, Proceedings of IEEE INFOCOM 98, June 1997. J. Reynolds and J. Postel, Assigned Numbers, RFC 1700, October 1994. S. McCanne and V. Jacobson, The BSD Packet Filter: A New Architecture for User-level Packet Capture, Proceedings of the 1993 Winter USENIX Technical Conference, January 1993. J. Vos and W. Konijnenberg, Linux rewall facilities for kernel-level packet screening,, June 1996.

link. We show the sequence number versus time plot for both reliable multicast and unicast TCP in Figure 5. As expected, the two plots are indistinguishable, though both protocols scale back their transmission rates during congestion. Of course, the scale of this test is too small to conclude that congestion control in reliable multicast preserves the congestion control in TCP; however, this test reiterates the fact that when there is a large heterogeneity in the receiver set, reliable multicast is as fast as the slowest unicast connection. The results in Test 2.b summarize the effective data rates. Test 3: For this test, we throttled down link + to approximately 1 Mbps, by introducing 10 msec delay between successive packet transmissions. Again, the sender transmits 1 MB of data at peak speed. In the rst part of this test, we generate 200 KB of congestion data at peak rate on both links at the same time. The sequence number versus time plot is essentially the same as Figure 5. Reliable multicast is as fast as the slowest unicast connection. In the second part of this test, we space out the congestions in links + and m . Specically, congestion on link + is induced after the congestion on link m has subsided. As shown in Figure 6, reliable multicast performs marginally slower than the slowest unicast connection overall, but at any instant, the slope of the reliable multicast connection is approximately the same as the slope of the slowest unicast connection. The results of Tests 3.a and 3.b in Table 1 summarize the effective data rates. Test 4: In this test, we compared the number of bytes lost in IP multicast with a slow link and congestion. While the comparison of unreliable IP multicast and reliable multicast based on TCP is unfair, it does indicate the amount of overhead and slowdown required to achieve a reliable delivery of data to the multiple receivers that user-level reliable multicast algorithms need to perform. We transmitted 1 MB of data at peak speed using multicasting with UDP and multicasting with TCP, respectively. In the UDP case, while the fast receiver did not lose any packets, the slow receiver received only 21.4% of the data on average. The key point we make here is that IP multicast is a pure datagram service, and a user-level reliable multicast transport protocol that operates on top of IP multicast will need to per-

[7] [8] [9] [10] [11] [12] [13] [14]