Sie sind auf Seite 1von 10

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 12, NO.

6, AUGUST 1994

I121

Fault Management Tools for a Cooperative and Decentralized Network Operations Environment
Ewerton L. Madruga and Liane M . R. Tarouco

Abstract-Some institutions do not have centralized computer network operations. In the case of UFRGS, some domains have their own group of people in charge of such a task locally and other domains do not. This work describes the university's network approach, as well as the fault management tools to support such an approach. One of them, the CINEMA Alert System, analyzes the network after polling entities and generates alerts when required. Another, the CINEMA Trouble Ticket System, helps the decentralized operations staff in cooperating during network failure recovery processes. The main features of the tools and software modules' organization to achieve a cooperative integrated network management environment are presented.

Processing

"Central" Microwave link

Fig. 1. The topology of UFRGS network.

I. INTRODUTION HE history of the UFRGS network is not different from many academic networks worldwide. Before a formal backbone was set, there were many LAN-based subnetworks spread in three campuses with individual connection to the national academic network each. Operations were done in each site in an independent fashion since each site had a local manager responsible for providing the best possible service to department users. In the case of local major problems, informal discussion among managers and network experts used to take place most commonly by e-mail. After the UFRGS network backbone finally connected the previously established subnetworks, they became domains of well-defined scope. Fig. l illustrates the topology of the network. Nowadays, equipment ranges from small PC and RISC workstations to a supercomputer CRAY YMP/2E. So the network community soon realized the need for an operations center which had to support such peculiar organization. To deal with this environment, a proposal for network management at UFRGS was designed. It is referred to as the Cooperative Integrated Network Management environment, a cooperative and integrated way to manage the TCP/IP UFRGS network. The university network community believes that the approach is more responsive to user needs. This is because it proposes Help Desks closer to user sites and domain managers supporting all activities. Once domain managers are more aware of the user

environment than in a centralized approach, it is expected that the problems will be solved faster. Among CINEMA network operations support tools, there is an application for trouble tracking that helps operators to cooperate with each other every time faults and problems appear. There is also an alert system that helps the operations staff to detect critical situations by monitoring specific indicators around the network. The alert system can create tickets automatically with the application programming interface provided by the trouble-tracking system.

11. CINEMA-A COOPERATIVE INTEGRATED NETWORK MANAGEMENT ENVIRONMENT As defined by [6] , the general goal of any Network Operations Center (NOC) is to provide a level of consistent network service to its user community. Furthermore, in the UFRGS case, the NOC has to cope with many established domains and problems coming from these domains, as well as from the backbone that connects them. It is important to notice that the network is continuously expanding, and not only will domains with well-structured management be added, but departments with user groups lacking such organization as well. Given the context, as suggested by [lo], faults are dispatched to NOC and solved in three levels. First Level: Misunderstood procedure, user software or equipment setup parameters. Such a type of problem is isolated and repaired immediately. Sometimes, a visit to user is needed. Second Level: Problems concerning network components' failure in either hardware, software, or application. An external or university technician visit is probably required. Third Level: Multiple component's failures are, in general, intermittent and not easily isolated. Such prob0 1994 IEEE

Manuscript received July 19, 1991; revised July 24, 1991. E. L. Madruga is with the University of Caxias do SUI, Brazil. L. M. R . Tarouco is with the Computer Sciences Post Graduation Programme (CPGCC), Institute of Informatics (11). Federal University of Rio Cirande do SUI (UFRGS), P. Alegre, Brazil. IEEE Log Number 9401465.

0733-8716/94$04.00

1122

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 12, NO. 6, AUGUST 1994

Domain Managers
Single TT Base

coordinating operations activities. The ticket priority policy choice, NOC service quality control, and network performance analysis are among Service Control main duties. 111. THE CINEMA TROUBLE TICKETSYSTEM Given such distributed organization for network operations, it is impossible for people to keep track of problems in the long run. A tool is required to help NOC members in this task. The so-called trouble ticket system allows any NOC operator either from the Help Desk, a domain, or Service Control to pick up an opened ticket and work on it. Since many common problems cannot be solved in few minutes, and perhaps the repairing process depends on an expert or technician not available at the moment, notes on a ticket are made every time some important event comes around to either inform the person responsible for the ticket about the new status or support later work on the ticket. There are many interesting trouble ticket systems which contributed to the CINEMA approach. A major reference in the subject is RFC 1297 [3]. Using his experience as member of Merit NOC, Johnson specifies a wish list for such systems. This RFC lists all desirable features for a trouble ticket system, one of the most important in our view being the need for integration with other tools. The CINEMA Trouble Ticket System supports integration with other applications through an application programming interface to be used by other tools designers. The HDMS System [8] is a quite complete Help Desk application which features an interesting Problem Management subsystem. Its vendor database is certainly a good idea for trouble ticket systems like UFRGS CINEMAs which intend to generate Mean-Time-BetweenFailure (MTBF) and Mean-Time-to-Repair (MTTR) reports. These indicators help the network operation center to identify what to inspect and what to do in the presence of repetitive problems. For instance, a lower MTBF indicator does not always mean that the vendor service is poor. If the MTBF of certain equipment is less than the one provided by its vendor, and an equipment of the same model and vendor located in other parts of the network has better performance, then this equipments environment conditions should be checked. But associating MTBFIMTTR to vendors is also a major point when the quality of technical support is to be assured. NEARnets ticketing [4]has some important attributes in its ticket structure. The close code field, for instance, is valuable for statistics on problems and associated solving procedures. It is a code that defines the sort of problem (bad network interface, software misconfiguration, etc.) concerning the ticket just closed. That code can be used as a reference later when new tickets are created, and when a survey on NOC activities (problems faced X their solutions) is required. Although both HDMS and NEARnets ticketing sys-

I
Help Desk

I
Help Desk
Help Desk

lems require specific monitoring tools and experts in data analysis as well. Aimed at facing problems on three levels, the network operations center is organized as illustrated in Fig. 2. There are three Help Desks acting as first contact points for user problems reports in each of the three campuses. They must solve first-level problems, which are supposed to be 70-80% of problems observed. In this way, Help Desk staff act as NOC front-end operators, repairing as many first-level faults as possible closer to problem sources. A Help Desk is the main contact for departments without engineers or experts to be responsible for a local organized management function. When a problem cannot be properly diagnosed at the Help Desk level, information about it is registered and forwarded to Service Control staff, domain managers, and/ or experts. Here is when the cooperative approach takes place. Messages warning NOC staff of a new problem to be solved are generated. If the problem is really urgent, the Help Desk calls the Service Control to report the situation and request immediate help. The urgent problem is then traced, optionally with the help of some of the domain managers. All relevant intermediary conclusions during the tracing process are registered. If the problem turns out to be at least of the second level, all information about it is registered in a trouble database. Besides supervising their domain, people charged with the same tasks as the ones performed by Help Desks, Domain Managers control network components and help solve problems of the second and third level using their expertise for doing so. And when domain managers, Service Control staff, and experts team up to help whoever is in need of help, a troubleshooting council is formed, and therefore cooperative network management takes place. At the present time, electronic mail is being used as the instrument for the exchange of opinions. Each of the councils members has access to the ticket system through a local client software module. This local module allows members to inspect notes on a ticket, tracing its history, and to contribute with more information. A domain manager, for instance, can check equipment in his scope which is not able to answer pollings (ping, SNMP GET operations, etc.) and update the ticket based on his or her impressions. The other part of UFRGS NOC organization is the Service Control. This is the NOCs staff responsible in

MADRUGA A N D TAROUCO: FAULT MANAGEMENT TOOLS FOR NETWORK OPERATIONS

I I23

?-7
Network

..........................................
Graphical User

Interface

pq
:

.......................................

.......................

agement domains in a similar fashion as suggested by the RFC 1271 (RMON MIB) [12]. Some objects and monitored entities are defined for each domain, and only critical data are sent to a central monitoring station. A major result of the deployment of this strategy is a lower network management traffic in the backbone. A set of selected network components, a set of selected MIB object instances for each component as well as their sampling intervals, and a sampling window are given to the alert system to start monitoring the network. Basically, the components being monitored are routers and some file/mail servers. Common MIB objects monitored are (considering RFC 1213 [ 5 ] )
sysUpTime

Fig. 3 . CINEMA trouble tracking

tems have interesting functionalities, they only allow humans to interact with the system. If other network management applications are used to detect critical situations, they cannot open tickets to start maintenance procedures automatically. In the case when other tools need information on previous problems in a given network domain, it will not be possible as well. Unlike these systems, the CINEMA Trouble Ticket System can be integrated with other applications, through a well-defined application programming interface. At present times, it is integrated with an Alert System, and both are part of the CINEMA environment as illustrated in Fig. 3 . The Alert System keeps polling the network to find candidate faults for problem tracking much like the network monitoring system in the University of Illinois at UrbanaChampaign [7]. Each time a threshold is exceeded and connectivity problems or error conditions are found, an alert may be created. At the current stage of the project, the system is just smart enough to avoid alert bursts. To achieve that, the Alert Systems module responsible for the generation of alerts uses a set of rules, information in the trouble ticket base, and in the configuration base, as illustrated in Figs. 3 and 4.
A . Alerts and Tickets

i f A d m i n S t a t u s and i f O p e r S t a t u s i f I n E r r o r s and i f O u t E r r o r s i f l n D i s c a r d s and i f O u t D i s c a r d s i f l n U c a s t P k t s and i f O u t U c a s t P k t s i f l n N U c a s t P k t s and i f 0 u t N U c a s t Pkts


Once repetitive resets are always harmful for the network and for the equipment itself, sysUpT i me should be monitored. When this objects value is found restarted too often in time, the monitored equipment is probably being reset due to malfunctioning, and this should be inspected. The actual status of each interface of a network entity ( i fOpe r S t a t u s ) should match the status desired by the network administrator ( i fAdm i n S t a t us). In the case of any difference, i.e., if a network interface is down when it was not supposed to be, a failure has occurred. A periodic check of these objects can show this discrepancy. The monitoring of both input and output discards as well as unicast and nonunicast packets can anticipate congestion conditions. Unicast and nonunicast counters help baselining traffic patterns and detecting broadcast storms. The sampling window is the amount of time in which the sampled values of an object instance are considered to detect any abnormalities. The window slides forward and discards the oldest sampled value when a new sampled value comes in. The new sampled value is compared to thresholds based on the average of the previous sampled values that lie in the window. If the lower or upper boundaries are crossed, an event is generated and an alert may follow. Then the cycle goes on. Some important practical aspects to consider about threshold calculation and sampling window sizing can be found in [7] and [l]. The CINEMA Alert System modules organization was designed using some ideas of the ALLINK Operations Coordinator(A0C) data flow [ l 11. In ALLINK, there are different stages where problem information passes through since the beginning, when a message comes from a subnetwork handler, is filtered, becomes an event which is processed and if needed becomes an alert, and then is finally displayed on the user interface. The CINEMAS

Whenever trouble tracking is a concern, alerts and tickets are very close concepts. An alert is related to the search for problems and a ticket to the solving process. Depending on the type of the emerging alert, a ticket should be created or not. Alerts in the CINEMA Alert System are created only when a repairing action is required. The systems user interface does not get event dumps, but instead only messages when a relevant failure is around. To achieve this purpose, the alert system collects data periodically for later processing. The tool is designed at a first step to be used centrally at Service Control and at local management domains. Nevertheless, the alert system can be used with some adjustments as a remote monitor, servicing different man-

1124

IEEE JOURNAL ON SELECTED AREAS I N COMMUNICATIONS, VOL. 12, NO. 6 , AUGUST 1994

Trouble Ticket System

w
(a)
W

Obs

= Event generated

(b)

Fig. 5 . The hysteresis mechanism to limit event bursts. Fig. 4. Alert system data flow.

5
Alert System approach is simpler, although very similar in this sense, and is presented in Fig. 4. The sampler module polls selected network elements in a given period requesting values of a given set of MIB object instances for each component keeping track of the instances values during a sampling window. The data requested by the sampler are sent to the data analyzer module. Each new sampled value arriving at this new module is compared to a lower (or falling) and to an upper (or rising) threshold. If any of these thresholds is crossed, an event may be generated. The way the rising and falling thresholds are used to limit event generation is quite similar to the hysteresis mechanism defined by [12] for the RMON MIB Alarm Group. The difference lies in the situations when a new event can be generated. In general terms, Waldbusser [12] states that a new rising event (the sampled value is greater than or equal to the rising threshold) can only be generated if the last event that occurred was a falling event (whose value is less than or equal to the falling threshold), as Fig. 5 shows. The analogy holds for falling events. However, in our opinion, this mechanism should consider another parameter. The hysteresis mechanism is effective in avoiding bursts of events when sampled values oscillate around the rising threshold, for instance [Fig. 5(a)]. If rising crossings happen sparsely in time and no falling events are generated in the meantime [Fig. 5(b)], we consider sparse rising events should be registered. The mechanism is not flexible for such situations. So the time between events of the same type either in seconds or in sampling cycles is also considered by the data analyzer module of the CINEMA Alert System. Once expired this length of time, any event is generated regardless of its type. Both falling and rising thresholds are computed by the system as the average of the previous sampled values that lie in the sampling window plus a factor times the standard deviation:
a

= =

falling factor standard deviation of values in sampling window,

p = mean of values in sampling window

RT
where

= p

FT =

+ (a x a) + ( 5 x a)

RT = rising threshold FT = falling threshold = rising factor

In fact, there are two of these factors, each associated with the rising and the falling threshold. These factors can be used for fine-tuning: network operators can broaden or narrow the number of network offenders to care about by setting these thresholds. The greater accuracy in changing environments with a permanent up-to-date network status is a major advantage of automatic threshold computation. Furthermore, operations staff is free from the periodic task of searching for typical days to get a network baseline. The stream of events coming from the data analyzer is the input of the event processor module. Based on a set of rules, this module will correlate events with events themselves, information in the configuration database, opened trouble tickets, and any available information base to generate alerts when appropriate. Alerts are then dispatched to the next module and displayed on the management workstation screen. There are rules designed to output alerts only when an action must be taken. Examples of these rules are shown in Section IV-B. The event processor is the most important module in the CINEMA Alert System since it is the one that decides whether an event should be notified to the operator or not. Therefore, it is the system part in which intelligence is placed. At the present time, this module performs some checkings on events searching for similar events already registered (same IP number, same object instance, same event type), but it can grow towards an expert system in future work. The last Alert System module in the sequence that begins with data collection and ends with on-screen information to network operators is the Alert Display Controller. This modules role is managing the output of alerts inserting, deleting, and getting user acknowledgments from the network operators workstation interface. If the sampler, data analyzer, and event processor modules are kept remotely, the alert system can be used with some adjustments as a remote monitor as previously mentioned. In this way, parts of the network can be monitored locally, the major advantage being the smaller overall network management traffic. The network components, sampling intervals, and the object instances to be monitored, as in the case of the Manager-to-Manager MIBs [2] and

MADRUGA AND TAROUCO: FAULT MANAGEMENT TOOLS FOR NETWORK OPERATIONS

I125

RMON MIBs Alarm Group, could be set by way of control table entries sent to the remote monitor through the network.

r9
Tlclgt I d

Opsn TIcket

TlcLat Status: ~ p l n
Open Time: 0457pm

o p e n Date: 07/10193 tontact:


Workstation:

Opened by:
Extension:

L 6161

coruja

E-mail: rnaujosalnf

Domain:

inf.uhqr

Rerronslble:

mnslon: 4 1 2 3

E-mall: lkcdnOc

B. The Ticket Structure


The main purpose of the UFRGS network concerning trouble tracking is to minimize MTTR while supporting the cooperative integrated network management. As illustrated in Fig. 2, there is a single-ticket database. This means that Service Control staff, domain managers, and Help Desk people perform updates, i.e., open, make notes, and close tickets, in a single space. With such integration, either a global view of network faults (inside and outside domains) or a specific view (own scope of faults) is accessible for all NOC members. In this sense, a single trouble ticket system allows NOC members to provide support to each other and to be aware of current status of the whole network regardless of in which building or campus they work. There are some special fields on which the trouble ticket system counts to support the cooperative management. At every ticket open, as Fig. 6 illustrates, there is a field for a brief description of what is being reported by the complainant, apart from the fields that identify a ticket (like Ticket id and Ticket Status). This field completion is menu-driven, easing later queries and ticket identification since it provides predefined options. With those predefined options, users can refer to tickets also using the sort of problem to which it is associated. The menu-driven completion ticket fields at every ticket open are available to get as much information as possible from the user calling the operations center. This set of fields is there to help operators proceed with the standard procedures every Help Desk system should support to listen to the user at the very first contact. Menu-driven completion fields compose a feature of many Help Desk systems observed, where there are completion options for most of the fields. Menu-driven fields are important to avoid major typing mistakes whenever automatic fill-up is not possible. Since failure recovery usually depends on the intervention of many people in a decentralized operations center, there should be a staff member responsible for supervising the recovery process. This is the one who should contact technicians if a problem solution has been delayed, and call maintenance over and over if needed no matter if he or she is currently the one working directly with the specific ticket. Therefore, a field Ticket Responsible is used for this purpose. The ticket responsible could be assumed as the operator who has opened the ticket. Whenever ticket responsibility changes, the new responsible staff member gets a notification. The staff member working on the ticket is not always the one responsible for the ticket. After the ticket is opened, a visit to the user might be required, for instance. In this sense, there should be a way to assign a problem to someone who is in a better position to help. The field Ticket Dispatched to is used for this purpose at every

Notifications: liane. mumu Brief Description:TW hlgh rerponre timh

(Save)%

(SU ) I I Q

Fig. 6 . Information for every ticket open.

ticket note. As soon as the ticket is dispatched to someone, a notification is made to this staff member. At every note made on the ticket, an important event has occurred. If the operator or technician working on the ticket has not finished his task yet, there are things left to be done. In this case, a special field can be used to remind the staff member of the action to be taken at the right time. The field Next Action is used for that. A time and a text stating the step to be taken in the near future are associated with the action. The trouble ticket escalation process presented in the next section keeps track of this field for all opened tickets, and notifies the NOC staff members currently working on them when it is asked to. As stated before, another concern of the CINEMA Trouble Ticket System is controlling the quality of equipment and vendor support. Two of the commonly used fields among many ticket-tracking systems can serve this purpose. They are the ticket open and close time which belong to the ticket record structure. The difference between the open and close time of a ticket is used to calculate the Mean-Time-to-Repair for a network component, and the time difference between the previous close and the next open time is used to calculate the Mean-TimeBetween-Failures. Once MTTR and MTBF are available, reports can be generated correlating these data with manufacturerhendor data fetched from the NOCs configuration database.

C . A Case Study For a better comprehension of the CINEMA trouble ticket systems functionality, let us take a look at a typical case in which the system would be used. An end user from the Biotechnology Center calls the Help Desk located in campus Vale. He has felt a sudden increase in response time on his local network. The Help Desk operator hears the user and takes note of the users information: name, e-mail, phone extension, and the machine where he is currently logged in. Based on the latter, the ticketing system automatically fetches configuration data, such as domain and IP address. Then the operator follows a standard procedure, exchanging with the user as much information as possible, trying to solve the problem as a firstlevel one. Since none of the first tests proved successful to help diagnose the fault, the operator can decide to open a ticket. Some fields are automatically filled up: the responsibility for the ticket is given to the operator opening

1136

IEEE JOURNAL ON SELECTED AREAS I N COMMUNICATIONS, VOL. 12, NO. 6. AUGUST 1994

it, and the field dispatched to is given to the campus technician encharged with the domain Biotechnology Center. As soon as the ticket is saved, the technician is notified by e-mail, and when his shift starts, the ticket is then picked up. Investigation begins by collecting data into the users domain. Using monitoring tools, the technician observes too high collision rates in network interfaces around the biotechnology domain. Since he is not acquainted with such a type of situation, the technician issues a note on the ticket with a help request to NOC engineers and experts. Some of the replies advised a gradual deactivation of workstations. While deactivating one workstation at a time, the technician found a probable faulty interface since the collision rate stopped increasing. This faulty interface is sent to Service Control, from where it will be sent to repair. A new note on the ticket is issued explaining the isolating procedure. The Next Action field is set to 24 hr later when the technician is supposed to test the faulty workstation with a spare network card. Once the interface was replaced and the workstation reactivated, everything seemed to be okay. A new note on the ticket is issued, explaining the new domain status and dispatching it to the former operator working on the ticket. Only after the complainants acknowledgment can the ticket be closed. A close code identifying a class of problems related to network interfaces is set, and a reference to the replaced network interface manufacturer is made. Whenever new problems related to collisions appear, this tickets solution procedure can be consulted and used as a guideline for troubleshooting.

KERNEL

Server

System

Fig. 7. A clientisewer approach for trouble tracking

L). System Modules Organization The CINEMA Trouble Ticket System is based on the client/server model, as shown in Fig. 7. The ticket server is composed of five main modules: a console, the watchdog, the kernel, database, and notification interfaces. Its clients are any other applications that happen to be joining the CINEMA environment, as the alert system itself. It is important to note that the Trouble Ticket User Interface is also a client in the ticket systems approach since it is supposed to run in workstations all around the university. Due to its very nature, an NOC is there to keep track of problems. The bigger the network, the more problems the operations center is supposed to handle simultaneously. So recovering failures as soon as possible is very important to NOC and its image. The watchdog is the module responsible for helping network operators not to slow down and to keep productivity in reasonable terms whenever time is a critical factor. It schedules all Next Action ticket field values and notifies who is supposed to be notified in the proper time. The watchdog module is also responsible for warning the trouble ticket system administrator (or somebody else if specified) when there are tickets neglected for a certain amount of hours or days, and when there are tickets opened for more than a certain amount of hours or days. In the latter case, tickets have their priorities optionally increased. If a ticket

dispatched to a technician, for instance, remains forgotten for more than one day (because there have been no more notes on it since then), the Service Control supervisor should call him to check what is happening. The supervisor should be notified about tickets opened for a considerably long time (in NOCs point of view), when the workload will be evaluated, and possibly new staff that could be hired. There are many parameters to be set which will reflect in the services provided by the problem-tracking system. Furthermore, as in the case of other database systems, there is a need for somebody who must administrate the application. That is what the Console module is for. Security should be one of the systems concerns. Using the console, the administrator registers all people who will be able to work on opening, issuing a note, or closing a ticket: Help Desk and service control operators, service control supervisor, technicians, domain managers, and eventual NOC collaborators. Further details on each of these classes of the operations staff are also set using the console, such as associating technicians to domains. As mentioned above, some important parameters either affect global service or should be kept homogeneous for all ticket system clients. Therefore, they must be handled centrally, and then are set by using the console module. The number of priorities supported is an example. The console can also be used to specify the escalation time to each priority, i.e., the period after which the administrator gets notified by the module watchdog about neglected tickets. The trouble ticket servers main module is the Kernel. That is the module which carries out all the services requested by CINEMA environment applications. All open, and close transactions-related requests coming from the network are serviced by the kernel based on the parameter setup made at configuration time. The database and notification interfaces provide support to the kernel services. Their main purpose is shielding the kernel from the technology details used for information storage and delivery. The database interface implements the functions required for a specific system, either a conventional relational or a distributed database system, commercial or not. In the kernels point of view, it does not matter which one is the database system in use.

MADRUGA AND TAROUCO: FAULT MANAGEMENT TOOLS FOR NETWORK OPERATIONS

1 I27

This is also the case for the notification subsystem. In the presence of limitations or problems, this module must choose which is the channel to be used. For instance, if a technician is not reachable by e-mail, the notification module either switches automatically to fax (and sends messages to the site closest to the technician) or sends messages to somebody else who is supposed to find the technician. IV. IMPLEMENTATION ASPECTS The prototypes of the CINEMA environment tools presented in this paper are being developed under a UNIX platform. The CINEMA alert and ticket applications use SunNet Manager package API (Sun Microsystems Inc .) as a development platform. A configuration database is set using this packages console, and is accessed by both CINEMA applications. This console is also used by NOC operators for eventual data and event requests, and when network topology is to be consulted. SunNet Managers API is used by the alert system to collect data around the network before information analysis is performed. The APIs model based on the remote procedure call provides a comfortable environment for application designers. Besides the regular data for monitoring (object instances, monitored entities, polling intervals, etc.), the alert system provides a callback function to the API. This function is run every time polled data, a trap, or a polling error arrives. The model is comfortable because the application needs neither to do the actual polling nor keep track of intervals between pollings. All that is required is the function to handle data coming in at the intervals provided at the start. The alert systems sampler module is the one that supplies the monitoring data to SunNet Manager, and the data analyzer module is exactly the callback function provided. For every object instance coming to the callback function, the sampled value and its associated arrival timestamp are stored in a local disk file. But before that, the workstation running the alert system, and therefore running the callback function, will compare the incoming value to the thresholds for that specific object instance and generate an event if needed. Incoming traps become events automatically, without any processing by the data analyzer module. A major limitation of the current version of SunNet Manager is the lack of the possibility of sharing its runtime configuration database among NOC operators. Due to this fact, this database must be replicated through different sites. The natural implication of this approach is that any change in the network configuration (new netv. ork nodes, links deactivation, router operating system update, etc.) must be reflected in the various database copies. At the present time, the NOC Service Control coordinates this update process. In the case of any changes, the local domain manager must inform the Service Control that is charged with informing all other domain managers and Help Desks about the new network status. If,

in the future, a new version of the commercial package used supports a distributed configuration database, the domain managers are the ones who will update their domains local information. The SunNet Manager package provides an interesting platform for application development, but has some points that could be improved upon. Its API allows applications to have partial control of what is to be displayed on SunNets console. Applications can create icons for monitored entities (hosts, routers, bridges, workstations, links, etc.), connect them in logical views of the network, and search for them later. This feature provides an alternative way to check the network topology. Unfortunately, it is not possible to change the status of objects in the console, that is, if the customized user application detects a critical situation in a managed entity, there is no way to make the entitys icon blink. Icons blink only when data are requested from the SunNets console. Some of the packages drawbacks can be addressed by the customized applications. The management information base available in the managed entities provides raw information through the object instances. One interesting way of adding value to the obtainable information is creating expressions from object instances. Considering RFC 1213 [ 5 ] , if the operator reads the percentage of input errors in a given interface in the exi f InErrors / ( i f I nUcastPkts + pression i f I nNU - c a s t P k t s ) , it will probably be more meaningful than just polling the number of input errors, i.e., i f I nE r r o r s . Expressions are not supported by SunNet Manager, but they are supported by the CINEMA Alert System. Another interesting feature not supported by this package but supported by the Alert System is automatic threshold computation. Without it, the operations center needs to periodically compute thresholds to reflect the current network status. CINEMAS tools use a windowing graphical user interface to gain flexibility (Fig. 8). The cut and paste feature can be used, for example, to store into a ticket field a mail message reporting problems. An alert can be dragged from the Alert System window and dropped onto the ticketing window to open a ticket whenever needed. To take advantage of those attractive features, the prototypes are being developed using the XView programming interface (Sun Microsystems Inc.). A . Application Programming Interfaces Care has been taken in designing an Application Programming Interface (API) for the trouble ticket system (Table I). The tickets API has four classes of primitives, allowing other CINEMA applications to retrieve and update the trouble tickets data. The programming interface is connection-oriented and mapped to UNIX socket system calls. When the client wants to retrieve tickets, for instance it should establish a session with the ticket server using C i n e T T - E s t a b l i s h ( ) , and use the C i n e T T - G e t F i r s t T k t ( ) and C i n e T T - G e t -

1128

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 12, NO. 6, AUGUST 1994

CINEMA Trouble Ticket System v1.0

(Open) (Note) (close) (query)(m) Tkt Sewer:.


Close Ticknt
1801

TABLE 11 CINEMA ALERT SYSTEMS EVENT PROCESSOR PROGRAMMING INTERFACE


Class Input Outuut
Primitive Description CineSA-GetNextEvent 0 reads a new event outputs an alert t o be displayed CineSAPutAlert 0 CineSAAckAlert 0 acknowledges an old alert

Ticknt i d 1801 Close Date: 08/13/93 close T h e : 01:05prn

Close Code: Bad NetWwk inferface Component: minuano-ne1000


Solution Description

(Opt10ns)

Fig. 8. Trouble ticket system user interface.

every ticket. It means, for example, that ticket system users can search for all tickets that refer to a word similar to modem in any of the ticket fields. Table I1 shows the primitives provided for the Alert Systems Event Processor module development. Although small in number, the programming interface has its flexibility increased by the APIs provided by other CINEMA applications, like the trouble ticket system. Recall that this is the module where filtering is performed.

TABLE I CINEMA TROUBLE TICKET AP1


Class

FOR

CLIENTS

B. Filtering
The incoming events read with the input primitive in Table I1 are submitted to a set of rules for alert generation. These rules are built based on the practical experience of the universitys engineers. The filtering process may result in discarding some incoming events, increasing counters for others, and generating alerts based just on one or a set of events that really demands it. It should be noted that it is up to the Event Processor module designers to create their own strategies to handle the events. It means that all other modules are stable, but the event processor module is the only one that keeps being customized progressively. Currently, this module keeps an event history table to perform simple event correlation for alert generation. This correlation and general conditions are set on an i f . . . t h e n . . . e I s e . . . basis. At the present time, for instance, one of the Event Processors concerns is related to connectivity problems that arise when a microwave link which connects one of the campuses to the rest of the universitys network goes down. When there is an event indicating that the link went down, all other later events related to hosts and network components located on the other side of the link are ignored. The semantic tree is illustrated by Fig. 9. The semantic tree in Fig. 9 is described by the rule

Primitive

I Description

Ticket

Ticket Read Error SuDDort

C i n e T T D e l e t e T k t0 CineTT.GetFirstTkt 0 CineTT-GetNextTkt 0 CineTTErrorO

delete a given ticket from DB fetches the first ticket of a given query fetches next ticket in current query Drovides a strine eiven an error code

N e x t T k t ( ) to obtain the information required. When ready, the client uses the C i neTT-Te rrn i n a t e ( ) to release the session. A connection-oriented model was chosen to provide a reliable communication between server and clients located all around the university. Furthermore, this model can be used to bootstrap parameters on the client side that should be kept homogeneous among all clients. That is the mechanism used, for instance, to inform the ticket systems user interfaces around the university of what are the priority levels with which the system currently works, and what are the escalation levels assigned to each priority level. To service the clients using the API above, the ticket server does not require many features from the database system used to store all tickets informations. All a database system needs to offer for the ticket server is a C language programming interface with basic database primitives that support, including updating, indexing, retrieving, and excluding records. Reliability mechanisms, such as atomic transactions and use of checkpoints, are also desirable. POSTGRES [9] is the database system used currently. It is a public domain software that fulfills these requirements and has some interesting extra features. The most relevant POSTGRES feature in CINEMA Trouble Ticket Systems point of view is the possibility of customization. The user can implement query language operators using C language and bind them to POSTGRES. We expect to use this feature to implement a full-record scan, that is, the search for records is based on a given regular expression compared to every field of

oblem i s c o n n e c t i v i t y ) oblern i s c o n n e c t i v i t y w i t h her side) AND c r o w a v e l i n k i s down) e r t not generated y e t )

AND

AND

C i neSA-PutA I e r t (mi c r o w a v e l i n k i s down!)

Another concern is multiple resets and power outages. Multiple resets indicate bad behavior in routers, for instance, and constant power outages can cause damages in the hardware of the equipment. With the help of counters, the event processor module can keep track of such repet-

MADRUGA AND TAROUCO: FAULT MANAGEMENT TOOLS FOR NETWORK OPERATIONS

1129

CONNECTIVITY

other side

not other side

microwave link
OK

traceroute microwave link alert already alert not put


I

I
I

\
Y t;

0 0

traceroute

available, the events in the log file are the first ones to be read by the Event Processor module, and the first alerts to be displayed in the user interface are the ones in the associated log file. Some of the alerts can become tickets with the help of the CINEMA Trouble Ticket System API. The information associated with those alerts is permanently kept in the ticket database with regular reliability mechanisms such as atomic transactions, checkpoints, etc. V . CONCLUDING REMARKS For a decentralized operations approach, this work proposes integrated tools for network operators and experts to cooperate. The Alert System analyzes the network after collecting data, and alerts are generated when needed with the help of a set of rules. The so-called Trouble Ticket System stores the problem-solving process, aiming to support cooperation in pending problems resolution, maintain a base of experience in failure recovery, and control vendors products and service quality. Although designed for a TCP/IP environment, those tools use techniques suitable to other environments and platforms. Before the CINEMA tools were available, many problems remained hidden in the universitys computer network once there was no handling for the traps sent by major network components (routers and servers). The traffic pattern was completely unknown, and there was no way to anticipate failures. The user calls were the most common way for the informal operations group to be aware of problems. After the prototypes became operational, proactive network management can be done since critical situations can be anticipated. This is especially true when it comes to performance management. The performance degradation can increase progressively with the expansion of the network, and now there are tools to provide real numbers to be studied and considered in any plans for expansion. The CINEMA project does not stop with these tools. Future work in the project includes the development of an expert remote monitor to detect, isolate faults, and handle tickets automatically.

a a a

Put

I I ignore Put-Alert(MWLink NOK ) incomina event

Fig. 9. A semantic tree for connectivity-related failures.

itive situations. The rule used in the system is based on the time since the last initialization of the monitored entity (or UpTime), and on a counter that keeps track of how many times the UpTime indicator is less than 10 min. The rule looks like
IF

(UpTime i s l e s s t h a n 10 m i n u t e s ) AND (Uptime-counter i s greater than

3)
THEN
C i neSA-PutA I e r t (Check doma i n f o r power o u t a g e o r e q u i p m e n t for multiple resetting)

Given the rule examples above, the event processor is clearly suitable for the use of expert systems techniques, although it is not yet developed with this purpose in mind. At the present stage of the project, there is a strong relationship between the user-chosen object instances informed to the sampler module and the rules in the event processor module. If the user adds or deletes objects to be monitored at runtime, some rules can become obsolete. We believe there are rules that are not affected by the inclusion or exclusion of objects. Connectivity handling is an example since the rules handle the absence of a reply from any monitored entity rather than polling a specific object instance. However, there are other event-handling d e s that are bound to object instances polled periodically by the system. For this reason, alert systems should be flexible enough to allow such rules to be read by the system at start-up, together with the objects to be monitored. This idea arose after the first tests of the CINEMA Alert System, and should be incorporated into the system in its next version. One of the concerns any alert system should have is reliability. In the case of UFRGS, power failures are unfortunately common. To avoid missing critical data in such situations, the system keeps track of events and alerts also through the use of log files. At start-up, the system looks for the presence of such files in disk. If they are

REFERENCES
[ I ] M. Antonellini and L. Sebastiani, Error rates: A convenient technique for triggering fault management procedures, in Proc. IFIP TC6lWG6.6 Symp. Integrated Network Management, Boston, MA, May 1989, pp. 353-363. [2] J. Case et a l . , Manager-to-manager management information base, SNMP Research, Inc., Request for Comments 1451, Apr. 1993. [3] D. S . Johnson, NOC internal integrated trouble ticket systemfunctional specification wishlist, Merit Network, Inc., Request for Comments 1297, Jan. 1992. [4]D. Long, NEARnet trouble ticket system, BBN Systems and Technology, Cambridge, MA, 1991 (available via anonymous FTP from nic.near.net). [ 5 ] K. McCloghrie et a l . , Management information base for network management of TCP/IP-based internets: MIB-11, Request for Comments 1271, Mar. 1991. [ 6 ] K. R. Meyer and D. S . Johnson, Experience in network management: The Merit network operations center, in Proc. I1 Int. Symp. Integrated Network Management, I . Krishnan and W. Zimmer, Ed., North-Holland, Apr. 1991, pp, 301-311.

1 I10

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 12, NO. 6 . AUGUST 1994

[ 71 D. C.-H. Sng, Network monitoring and fault detection on the Uni-

versity of Illinois at Urbana-Champaign computer network, Rep. UIUCDCS-R-90-1595, Univ. Illinois at Urbana-Champaign, Apr. 1990. J . W. Stewart and J . K . Scoggin, Help Desk Management SystemUsers Guide, Delmarva Power, Inform. Syst. Group, Network Operations, Newark, DE, (available via anonymous FTP from ftp.delmarva.com). M. Stonebraker. POSTGRES Reference Manual, Univ. California, Berkeley, May 1990. K . Terplan, Communication Networks Management. Englewood Cliffs, NJ: Prentice-Hall, 1987. G . Tjaden er a l . , Integrated network management for real-time operations, IEEENefwork, vol. 5 , pp. 10-15, Mar. 1991. S . Waldbusser, Remote network monitoring management information base, Carnegie Mellon Univ., Request for Comments 1271, Nov. 1991.

Ewerton L. Madruga received the B.Sc. and M.Sc. degrees in computer sciences from UFRGS in 1990 and 1994, respectively. He is currently working as an Assistant Professor at the University of Caxias do SUI,Brazil, and his areas of interest are computer networks, network management, and distributed systems.

Liane M. R. Tarouco received the B.Sc. degree in physics and the M.Sc. degree in computer sciences, both from UFRGS, in 1970 and 1976, respectively, and the Ph.D. degree in electrical engineering from Poli-USP/Brazil in 1990. She is an Associate Professor at the Institute of Informatics, UFRGS. Her areas of interest are network management, expert systems, and information processing.

Das könnte Ihnen auch gefallen