Sie sind auf Seite 1von 5

Free-Scaling Your Data Center

Paper #79, 14 pages

Categories and Subject Descriptors


C.2.3 [Computer-Communication Networks]: Network Operations

Input: P = {e1 , . . . , en } the links of the path B a mapping between the links and their Bloom identiers Algorithm
1 a = 0 . . . 0 // the initial address 2 for i = 1, . . . , n do 3 a = a | B[ei ] // bitwise OR operation

General Terms 1. EFFICIENT SOURCE ROUTING


Here we propose the Ecient Source Routing (ESR) mechanism that exploits the favorable properties of the Scada topology, while addressing the routing challenge posed by the heterogeneous structure. ESR applies the paradigm of source routing motivated by the following reasons. First, source routing oers tight performance management and multipath capability for loadbalancing. Second, a data center is typically operated by a single organization; therefore, there are no prohibitive, Internet-related security problems, and global topology is known by the operator. Third, the topology of a data center is static, especially compared to the Internet. Driven by the drift to operate data centers cost-eectively, a profound goal is to use cheap, commodity network equipment within the data center. Thus, a routing mechanism which requires switches with moderate resources, e.g. memory and CPU, is highly welcome. In case of source routing, the computation burden at intermediate switches is drastically reduced; switches forward packets based on pattern matching without having to calculate the next hop. In addition, ESR supports multicast, and requires the propagation of routing information only in case of network failures and data center expansion, resulting in low control trac volume. In this section, rst, we introduce the applied addressing scheme, second, we describe the path selection procedure and error recovery, and nally, we evaluate routing performance by means of simulation.

Figure 1: Bloom address computation be large, which contradicts the goal of applying costeective, commodity network equipments. Accordingly, ESR addressing is based on Bloom lters [2]. Unique identiers are generated for every link in the topology; these are the Bloom IDs of the links. The same set of Bloom IDs is stored by every server in the data center. In order to create a Bloom address for a target server, every single identier along the chosen path is summed up using the bitwise OR operator, from source to target. The algorithm of Bloom address computation is presented in Figure 1. The resulting Bloom address, stored in the packet header, contains the routing information for the packet, i.e., the used links are implicitly stored. Bloom addresses have local validity, meaning that a demand can be routed to the specic target only from the source server where the Bloom address was generated. Note, that address size is xed for all path lengths. As the packet traverses through the network, along with its routing information coded into the Bloom address, intermediate switches inspect it. Any given Switch compares the address of the packet with the IDs of their outgoing interfaces; the packet will be forwarded on the matching links (except for the incoming link, to avoid loops). We present the forwarding procedure of switches in Figure 2. Note, that the forwarding procedure inherently supports multicast; the source node just has to code all desired link IDs into the Bloom address. Also, ow entries can be created to speed up the forwarding of packets of the same ow. Network switches execute the bitwise AND operation only, which has modest resource requirements. This

1.1 Addressing and Forwarding: Bloom Filters


Since they have stochastically heterogeneous topologies, routing on Traditional IP addresses would be inecient in Scada networks. As node addresses cannot be aggregated properly, routing tables sizes would 1

Input: L = {e1 , . . . , en } direct links of the switch ec incoming link of the packet B a mapping between links and their Bloom identiers p the Bloom address of the packet ForwardPacket(p)
1 for i = 1, . . . , n do 2 3

// bitwise AND operation if (i = c) and (B[ei ] = B[ei ] & p) then forward the packet on link i

level capability for data centers, since numerous cloud applications include routines where the same data have to be shared with multiple other servers (1-to-n pattern); MapReduce [3] being the best known example. In Scada, creating multicast packets is as easy as creating unicast packets; the Bloom IDs of all desired links should be incorporated into a single Bloom address.

1.2 Path Computation


The Scada structure is based on scale-free networks; thus, the path computation mechanism of ESR is based on a theoretic concept proposed for the special structure of scale-free networks. The ecient routing algorithm [7] has been designed for scale-free networks generated with the BarabsiAlbert method. It avoids a overloading high-degree network nodes (so-called hubs) by weighting links with the degree of the nodes; the weighting factor is denoted by . Accordingly, trafc load is spread in the network resulting in improved available bandwidth conditions, and hence decreasing oversubscriptiona key metric in todays data center networks. In order exploit and adapt to the properties of the Scada structure, we slightly modify and extend the ecient routing mechanism. First, we design ESR to be a source routing scheme, where the computation of end-to-end paths and Bloom addresses is carried out exclusively by the servers. This modication facilitates the use of commodity switches inside the data center by pushing computationally intensive tasks to the edges, while switches only perform simple bitwise operations. Second, ESR determines not only the shortest ecient paths between source and destination based on the weighted topology, but also derive the paths that are only reasonably longer than the shortest paths. The path length scaling factor (stretch) is denoted by . Based on , ESR calculates multiple paths between the end-points; some paths may have an additional hop compared to the shortest ecient paths, but this does not aect routing performance noticeably. This extension assures that the ESR method derives multiple paths even if the data center is built out of identical switches, where the weighting of the original ecient routing would not assure the spreading of trac load. Finally, two paths are picked randomly from the feasible paths: one for routing the packets of the ow and one as a backup that can be used immediately in case of failures. All packets of a given ow are routed on the same path; however, the load of multiple ows are spread on multiple paths. Accordingly, ESR achieves adequate load-balancing in Scada data centers. The ESR algorithm is dened in Figure 3. Note that the size of routing tables is manageable, as each server is only required to store entries for other servers to which an active ow exists (of course, larger caches can be maintained if resources allow for it). 2

Figure 2: Forwarding procedure

Table 1: Preferred Bloom lter parameters in case of dierent data center sizes, probability of false positive match is lower than 104
DC size 500 1000 5000 10000 Diameter 4 5 6 7 ID length 59 73 87 102 Used bits 8 9 10 9

is in line with the implications of recent data center trac measurements [1], i.e., due to the high arrival rate of the ows, forwarding/routing decisions have to be made quickly. False positive matches can occur in case of applying Bloom lters. Should such a false positive arise at a switch, the packet would be also forwarded on an additional link, creating unnecessary overhead and may also eventuate forwarding loops. The probability of such an occurrence depends on the size of the identier space; i.e., if the links are labelled by larger identiers the possibility of a false positive match decreases. More specifically, it can be shown analytically that the probability of a false positive match can be expressed as P = 1 1 1 m
nk k

(1)

where m denotes the number of bits of the identiers, k denotes the number of utilized bits, and n denotes the the number of elements in the Bloom addresses. Based on the formula we show applicable Bloom ID lengths for dierent data center sizes in Table 1 assuming that the probability of false positive matches cannot exceed 104 . Nevertheless a relatively small Bloom lter can eventuate small false positive probability, Scada benets from false positive-free operation by choosing false positive free paths exclusively similarly as in [6]. Clearly, Bloom lters can eciently accommodate the variable length of paths in Scada data centers, not present in other, symmetric state-of-the-art data center structures. Also, it complements source routing, and allows for swift forwarding decisions at switches. In addition, the application of Bloom lters oers built-in multicast support. Multicast emerges as a crucial routing

Input: G(V, E) the Scada network , parameters dv degree of node v s the source of the demand t the target of the demand Algorithm
1 for e = (n1 , n2 ) E do 2 we = d 1 // weight of the link n 3 sp = shortestPath(s, t, w) 4 c = cost of sp 5 feasiblePaths = [] 6 for p between s and t do 7 cp = cost of p 8 if cp c then 9 feasiblePaths p 10 ep = random(feasiblePaths)

Figure 4: Illustration of the Ecient Source Routing mechanism; switches are numbered (degree in parentheses) and Bloom link IDs are shown in this case both paths are feasible and therefore one is picked randomly to be utilized. On the contrary, in case of = 0 the lower path is selected as it has only four hops contrary to the ve hops of the upper path. Assume the latter scenario, hence the path A 3 4 5 B is chosen by the path computation procedure. Next, the Bloom IDs of the aected links are aggregated into a single Bloom address of 011011 (boxed). The packet rst arrives at switch 3, where the in-packet Bloom address only matches the link that goes towards switch 4. Similarly, the Bloom address of the packet is compared to the link IDs at switch 4. As the packet only matches the 01000 pattern (other links not shown), it is forwarded towards switch 5. There, the address matches the link ID towards server B, therefore, the packet follows that link, and arrives at server B. Now, if server A wants to send a multicast packet to both Server B and C, it has to compute the in-packet Bloom address accordingly. Let us assume that the path computation procedure selects the same path towards server B, and A 3 4 5 C for server B. As the link 5 C has the Bloom ID of 10000, the resulting Bloom address will be 111011. This address will match both links depicted at switch 5, and the packet is delivered to both destination servers.

Figure 3: Path computation in Ecient Source Routing A recent measurement study [1] revealed that data center trac is dominated by very short ows. In such a dynamic environment, a source routing scheme, which probes the network for good end-to-end paths when a new ow is initiated (e.g., BCubes BSR [5]), is ineective; the results of such probes are probably not valid by the time packets start to traverse the chosen route. ESR does not require any kind of signaling during normal operation. Instead, ESR relies on its load-balancing enabled path computation, and achieves good results. As a bonus, commodity switches do not have to respond to probes, hence saving valuable resources. Regarding multicast, ESR determines the feasible paths for every target server, randomly picks a single path to be used, afterwards the Bloom addresses of the picked paths are aggregated (link-by-link) to form the Bloom address of the multicast group. Albeit this method does not create the optimal Steiner tree of the multicast group (NP-complete), paths can be computed eciently and can be used in a live environment. Potential overhead is also mitigated by the dominance of short ows in the trac mix [1]. We discuss potential enhancements in Section ??. We illustrate the Ecient Source Routing method in Figure 4. Note, only links of interest are shown (actual switch degrees in parentheses). Server A wants to send packets to server B; the ESR method is run at the server to determine the path on which the packets will traverse. Let us suppose that the ESR is operated with the = 1 and = 1.1! In this case, the length of the upper path is 16 + 4 + 4 + 16 = 40 while the value of lower path is 16 + 24 + 16 = 56. Accordingly, the upper path is the only feasible path as the lower path is longer than 40 1.1 = 44. By selecting the upper path, the ow avoids those links where high utilization is more likely as they are connected to switches with higher degrees. The impact of can be illustrated if = 1.4 is used, 3

1.3 Failure Handling


The Scada structure inherently tolerates network failures well, owed to their scale-free network inspired design as shown in Section ??. At the routing level, network failures are handled as follows. Failure of a network link is detected by the two neighboring nodes based on the loss of connection at the link layer. Failure of a switch/server is detected as multiple link failures by all neighboring nodes. Upon detecting a failure, a node generates a network failure (NF) control packet and sends it on all of its active links. The control message is propagated in the network based on a pruned broadcast method; accordingly, all servers will be aware of the network failure. The active paths that are aected by the failure are switched to their backup paths, while new ows are routed based on the updated network topology using the ESR method.

Avg. perflow throughput [Mbit/s]

160 140 120 100 80 60 40 20

Percentage of failed demands

20

ESR
15

10

1000 5000
0 0.2 0.4 0.6 0.8 1

0 0 5 10 15 20

Percentage of link failures

Beta parameter

Figure 5: The eect of the weighting factor on per-ow throughput The NF control message has a Bloom address of 11. . . 1; thus, it will match every possible Bloom ID at the switches. The source of the packet is set as the Bloom ID of the unavailable network link. We note that the Bloom IDs links are generated uniquely, false positive matches are a consequence of the aggregation of link IDs. The source addresses of the control messages are stored at every switch for a given, short time period (as a soft-state). An incoming NF packet is forwarded only if it is not present in the failed links table; otherwise, the packet is dropped. An adequate timeout period for the failure soft-states can be computed as the average propagation delay of links multiplied by twice the diameter of the network. If a network failure is corrected, a network recovery (NR) control packet is broadcasted in the network; consequently, notifying the servers about the newfound availability of the restored network equipment. The proposed failure handling method does not have a signicant impact on the performance of Scada data centers. The number of links on which a NF message traverses equals to the number of links of the topology, and the generated control trac overhead is proportional to the number of failures on the network. Based on [4], the number of simultaneous network failures is usually below a few hundreds in very large data centers; thus, the aggregate load of the failure control messages is negligible compared to the throughput capability of the data center networks.

Figure 6: Flow abortion ratio as the function of network failures for the 5000-server topology, with switches having 4, 8, 16, 24, or 48 ports. We simulated 1000 ows with exponentially distributed inter-arrival times ( = 1000 to ensure signicant competition among ows) and lognormal (ln N(10,2) taken from [1]) ow size distribution. The 1000 ows were destined at a randomly chosen group of 100 servers to induce meaningful crosstrac. Link capacities were uniformly set to 1 Gbit/s, and ESRs stretch bound was = 1.1. Results in Figure 5 are averaged over 20 simulation runs per parameter setting. It can be clearly seen that has a profound effect on per-ow throughput. The = 0 case is shortest path routing, and it is clearly inferior to ecient routing with some larger weighting parameter. Also, too large values will result in congestion, hence lower per-ow throughput. Note that the throughput values are higher in the larger topology because of the increased number of possible ecient paths. Second, we investigated the fault tolerance properties of ESR ( = 500 for inter-arrival times, = 0.5, = 1.3). Results in Figure 6 show that ESR reacts well to network equipment failures, resulting in almost 90% of ows completed even at a 20% link failure ratio. Note, that this result closely resembles that of Figure ??, although the slope of the curve is dierent. We can conclude that ESR uses the inherent fault tolerance of the underlying Scada topology in a near optimal manner.

2. REFERENCES
[1] T. Benson, A. Akella, and D. Maltz. Network Trac Characteristics of Data Centers in the Wild. IMC 10, pages 267280, 2010. [2] A. Broder and M. Mitzenmacher. Network applications of bloom lters: A survey. Internet Mathematics, 1(4):485509, 2004. [3] J. Dean and S. Ghemawat. MapReduce: Simplied data processing on large clusters. In OSDI04, San Francisco, 2004. [4] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. Vl2: a scalable and exible data center network. In SIGCOMM 09, pages 5162, 2009. [5] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu. Bcube: a high performance, server-centric network architecture for modular data centers. In SIGCOMM 09, pages 6374, 2009. [6] C. Rothenberg, C. Macapuna, F. Verdi, M. Magalhes, and a A. Zahemszky. Data center networking with in-packet Bloom

1.4 Routing Performance


We have evaluated the proposed ESR mechanism in a discrete event simulator written in Python. Trac was simulated at the ow-level, assuming fair sharing of link bandwidth among competing ows. Here, we investigate two dierent scenarios: the impact of ESR weighting parameter and ESR fault tolerance. First, we created 1000- and 5000-server Scada topologies with m = 2, pt0 = 2 and switch size distribution of (60,50,40,30,17) for the 1000-server and (300,250,200,150,85) 4

lters. In 28th Brazilian Symposium on Computer Networks (SBRC), Gramado, Brazil (May 2010). [7] G. Yan, T. Zhou, B. Hu, Z. Fu, and B. Wang. Ecient routing on complex networks. Physical Review E, 73(4):46108, 2006.

Das könnte Ihnen auch gefallen