Beruflich Dokumente
Kultur Dokumente
Abstract ing the state of Michigan. Data was collected from four
geographically distributed probe machines running custom
This paper presents the results from a detailed, exper- monitoring software. This year-long effort collected infor-
imental study of OSPF, an intra-domain routing protocol, mation continuously from MichNet’s intra-domain routing
running on a mid-size regional Internet service provider. protocol, OSPF. We used this data to analyze the perfor-
Using multiple, distributed probes running custom moni- mance of OSPF on a production network and discovered
toring tools, we collected continuous protocol information surprising anomalies. There is a long history of studies
for a full year. We use this data to analyze the health of looking for these kinds of problems in routing protocols.
the network including the amount, source, duration and pe- Detailed studies of routing on the Internet have uncovered
riodicity of routing instability. We found that information significant problems with the performance of the Border
from external routing protocols produces significant levels Gateway Protocol (BGP) [4; 7; 8; 9; 10]. BGP, an inter-
of instability within OSPF. We also examine the evolution domain routing protocol, interconnects the autonomous sys-
of the routing topology over time, showing that short term tems that comprise the Internet. As such, any significant
changes are incremental and that the long term trend shows problems with BGP can affect routing on the Internet as a
constant change. Finally, we present a set of detailed inves- whole. Unlike BGP, the work on intra-domain routing pro-
tigations into several large scale anomalies. These anoma- tocols such as OSPF and IS-IS has mainly focused on their
lies demonstrate the significant impact external routing pro- behavior in controlled envoironments [2; 3; 13; 15]. Distin-
tocols have on OSPF. In addition, they highlight the need for guished by the fact that they are run within an autonomous
new network management tools that can incorporate infor- system, intra-domain routing protocols have a significant
mation from routing protocols. impact on the performance of the local network. In addi-
tion, because inter-domain and intra-domain routing proto-
cols are often interdependent, problems with one can affect
1. Introduction the performance of the other. While these studies based on
simulations and small scale testbeds have identified impor-
Routing protocols are a critical component of the Inter- tant issues, they still fall short of identifying the issues that
net. Their goal is to ensure that normal traffic is able to ef- affect these protocols on production networks. Our work,
ficiently make it from source to destination. While routing however, has focused on understanding the performance of
protocols are based on simple theories, designing a proto- intra-domain protocols under real-world conditions. We at-
col that functions correctly and efficiently under real loads tempt to highlight several anomalies of OSPF that are not
is difficult. In addition, testing the protocols under realis- seen under controlled conditions.
tic scenarios is often impossible. Simulations of large scale Our work also complements the few existing studies that
routing are either too complex to perform or too simplistic have looked at the behavior of intra-domain routing proto-
to accurately predict the challenges encountered on real net- cols on production networks. Our work extends these stud-
works. Large scale testbeds don’t exist, and monitoring the ies by providing year-long analysis, corroborating others’
protocol on production systems is also difficult. This paper results and introducing previously unseen results. A study
presents the results of a detailed study into just this kind of of Qwest’s network focused on issues affecting the conver-
real-world system. gence time of IS-IS [1]. This study identified single, flap-
This study collected routing protocol data from Mich- ping links as the predominant source of instability. It also
Net, a mid-sized regional Internet service provider cover- made the theoretical case for using an incremental short-
Modified
50 Refresh
Refresh 3
Down
Updates per minute
30 2
20
1
10
0 0
Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar
Refresh: An LSA that contains no changes. These are removed for clarity. It is important to first point out that
used by the protocol to ensure consistency across all an individual link changing state results in changes for type
routers. 5 LSAs at the LSA level, while type 1 LSAs have more
Down: An LSA that has been explicitly removed from
than one link per LSA. This is why the type 1 data is dom-
the database. The age of the LSA is set to MAXAGE
inated by modified announcements with very few down/up
and announced to all routers, ensuring that this path is
announcements. The modified label in this case represents
quickly removed from the database. It usually repre-
something changing in the LSA; in particular one of the
sents a failed link.
links changing state. The up and down labels on the other
Up: An LSA that was seen previously but was not cur-
hand represent the entire router going up or down.
rently active. This usually represents a link coming
As we saw before, the baseline amount of refresh traf-
back up after a failure.
Modified: An LSA that has changed a parameter not re- fic represents the stable operation of the network. For the
lated to up or down status. This usually represents most part, this is the predominant source of announcements.
something simple like a metric change. For type 1 and However, for the type 1 announcements, the level of mod-
2 LSAs this often represents a change with one of the ified events exceeds the level of refreshes on several occa-
attached links. sions. For both LSA types however, the predominant peri-
Timeout: A marker in the data representing an LSA that ods of instability can be traced back to a single link. Due
has timed out in our local database. This LSA is not to the topology of the network there are far more type 5
ever sent on the network, instead it is used to recog- links than type 1 links. This means that a single type 1
nize LSAs that timed out before ever being explicitly link flapping produces a much higher relative level of traf-
marked down. fic. Several of these spikes represent significant periods of
instability. For example, during the end of May into the be-
These labels are also used to describe changes in indi- ginning of June we see about ten updates per minute (five
vidual links in type 1 LSAs and the set of attached routers updates/minute for both down and up events) from a single
in type 2 LSAs. It’s important to note that there are two customer connection. Not only does the amount of up/down
sources of updates that we add into the data. Since LSAs announcements increase during this period, but the refresh
are removed from the database after 60 minutes, we add an announcements are elevated too. As we will show later,
explicit update with a timeout label into the data. Since nor- this is due to the level of aggregation changing, resulting in
mal routers will broadcast the fact that they’ve timed out an more announced routes. Not only is the instability produc-
LSA, this update is usually followed by the down update ing more load on the network from down/up events, it also
produced by another router. In addition, since type 1 link introduces more announced routes, further aggravating the
deletions and type 2 router membership deletions are rep- problem.
resented by re-announcing the LSA with the removed link While these graphs give a good understanding of the
or router omitted, we add an explicit entry into the LSA la- characteristics of the updates over time, we need more de-
beled down to keep track of these implicit announcements. tail to better understand the traffic breakdown. In particular,
Figures 2 and 3 show the year-long announcements bro- we want to look at some of the numbers that are signifi-
ken down by update type. We have omitted most of the up- cant but don’t show up on the graphs. Table 1 shows the
date types because they don’t produce enough data to show distribution of LSA updates by type and by status. There
up on the graph. In addition, the type 5 up announcements are several interesting points in this table that we couldn’t
mirror the type 5 down announcements, so they have been see in the graphs. First, the number of new events gives an
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
Router ID
approximate count for the number of entries for each type. Figure 4. Number of links per router.
For example, there were 56 unique routers on MichNet over
the course of the year. Not all were necessarily active at one 40
Percentage of updates
that arrive with a new type 1 LSA are classified as new,
while never before seen links that are from a known router
20
as classified as up. While we classified 421 links as new, we
saw a total of 750 announced links.
Another interesting result is the high level of type 5 mod- 10
60 60
40 40
Down 30 seconds
Up 1 minute
2 minutes
20 20 5 minutes
0 0
0 10 20 30 40 50 60 70 80 90 100 110 120 0 100 200 300 400 500
Period (min) Duration of instability (seconds)
the large period of instability in the later part of May. As and up events indicates the links are flapping, or oscillating
we mentioned before, this is caused by a single customer quickly between the down and up states. This behavior is
connection. The instability, caused by the BGP connection definitely undesirable. On the other hand, the distribution is
to the customer, was injected without filtering directly into heavy tailed. About 1%, or 3790, of the type 5 up periods
OSPF. Router number 52, on the other hand, announces the and 2%, or 1098, of the type 1 up periods lasted longer than
largest single source of instability. An error in the router thirty days. On the other hand, about 3020 or 0.8% of the
configuration caused the router to announce a prefix as- type five down periods and 211 or 0.5% of the type 1 down
signed to France Telecom. Without any real underlying periods lasted longer than 30 days. One important consid-
connection, this route flapped unhindered for the period of eration is that we are looking at the distribution by period
a week. The explanations for the other routers are similar, count, not by length. In other words, a three month period
and can be found in [16]. is counted the same as a five minute period. So, while this
These results show that several routers contribute dispro- data shows significant amounts of traffic in the network, in-
portionately to the instability in the network. This is due to dividual links are predominantly stable.
individual links on the router and does not represent a prob- Looking at the amount of time each link stayed up or
lem systemic to the router in question. down gives us an idea of how much flapping is seen on the
network. We also want to understand how long these pe-
4.2.2. Length of instability riods of instability last. The next graph shows the length
of these periods of instability. We first collect all periods
We now know that the instability we see on the network is less than a given threshold. We then string adjacent peri-
usually due to a single source. The next issue we want to ods together to determine how long the instability lasted.
address is the timing of these periods of instability. First Figure 7 shows the cumulative distribution of the periods of
we examine how long a particular link stays down or up; instability for different thresholds. We strung together se-
then we investigate how long the periods of instability them- quences of up/down events that were 30 seconds, 1, 2, and
selves last. 5 minutes apart. The result is that most of the periods of
To understand how long links spent in the down and up instability are relatively short; 80% of the five minute se-
states, we measured the amount of time each link spent in quences last for less than two minutes. More significant are
each state. We then plotted the cumulative distribution func- the top few that last up to 5.5 days. The most prominent of
tion by number of periods over the length of these periods. these is the France Telecom link we discussed earlier which
Figure 6 shows the graph for the type 5 LSAs. The type continuously changes state at least once every five minutes
1 link announcements are similar, with a higher percentage for 5.5 days. In fact, this is the largest single source of insta-
of short periods. On an ideal network, we would hope that bility on the network. This is indicative of a larger problem
the links would stay down for relatively short periods of caused by mistyped prefixes causing single routers to inject
time, and up for significantly longer. This data shows that random prefixes into OSPF for short periods of time. The
there are many periods when the prefix or link was only second longest period of instability is a collection of routes
up or down for a very short period of time: 53% of the that flap every five minutes for 2.5 days. Unlike the previous
type 5 periods lasted less than five minutes, while 75% of period, this one appears to be composed of about ten legit-
the type 1 link periods lasted less than 2.5 minutes. While imate customer prefixes. They do however, exhibit the be-
the low periods for the down events might suggest stabil- havior we discussed before of a change in aggregation level.
ity, this concentration of very short periods for both down During the instability we see more prefixes announced for
60
1500
40
1000
30
20
500
10
0 0
Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar
Figure 8. Shortest path tree configurations. Figure 9. Shortest path tree differences.
800
10
600
400
5
200
0
20-May 21-May 22-May 23-May 24-May 25-May 26-May 27-May 0
Time (Dates are labeled at midnight) Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar
4.4. Detailed analysis the number of prefixes announced for this customer triples
during this period of instability. It appears that these ex-
The previous sections explored the aggregate, coarse tra prefixes are due to a change in the level of aggregation.
view of the network over the period of a year. From this Rather than aggregating several prefixes into one shorter
perspective, we can see that there are periods of increased prefix before injecting them into OSPF, the router was an-
instability, but we have little understanding of the specifics nouncing the longer prefixes directly into OSPF. This has
of these anomalies. In addition, the aggregate view can serious implications about the stability of the network. Not
completely miss significant anomalies that play an impor- only can the number of updates increase during periods of
tant role in the day to day performance of the network. instability, but the number of LSAs can increase due to ag-
This section focuses on two such anomalies. The first high- gregation failure. As with the periodicity, this instability is
lights a significant period of flapping caused by a single most likely caused by BGP failures being directly injected
physical link. The second anomaly involves a single router into OSPF without any filtering.
sporadically injecting random routes into OSPF. These spe-
cific incidents highlight the general lack of comprehensive
4.4.2. Aggregation Breakdown
network monitoring tools available to network operators as
well as the amount of instability external routes can intro- The previous anomaly leads to another interesting observa-
duce into OSPF. tion. Router 53 from figure 4 shows even more variability
as seen in figure 11. Unlike the previous case, the increased
4.4.1. Periodic Instability number of routes for this router never persists for longer
than three days. In addition, the increased levels are not
The most noticeable period of instability can be seen in fig- localized to a single period. Looking at the trouble ticket
ure 2 during the end of May and beginning of June. Look- system for MichNet, these increased routes correspond to
ing closer at the instability we can see that there are sig- recorded problems with this router. In addition, rather than
nificantly more updates during this period than normal. In adding subsets of existing routes, this router announces ad-
addition, the updates appear to be periodic. As you can see ditional prefixes from different parts of the network. For
in figure 10, this instability corresponds with the normal hu- example, announcing prefixes assigned to a customer from
man waking cycles for the local time zone. While this kind the western side of the state when the router is on the east-
of instability has been seen in BGP behavior before [9], this ern side. In addition, as is the case with the France Telecom
is the only instance we’ve found in the OSPF data. Talk- route, some of these routes appear to be completely unre-
ing to the network operations staff, the source of this failure lated to MichNet. Talking to the operations staff, it appears
was indeed BGP. During this time period, the particular cus- that all of these additional prefixes are caused by config-
tomer was upgrading their physical connection to MichNet. uration errors. The routes from across the state are topo-
The prefixes for this customer are maintained using a BGP logically neighbors to the announcing routers due to ATM
connection and are then injected into OSPF. However, there links that span the state. In addition, the France Telecom
are no filters placed on this BGP data, so when BGP failed, link has a prefix very similar to ones assigned to MichNet.
all of the instability was directly injected into OSPF. It appears that a simple typing mistake caused this prefix
Despite the fact that this data appears to be caused by a to be injected into OSPF without any real connection to an
single link, we see a large number of prefixes related to this external protocol. While it is entirely possible that these
instability, all allocated to the same customer. Surprisingly extra routes would cause some of the problems seen in the