Sie sind auf Seite 1von 11

CAUSES OF FAILURE IN IT TELECOMMUNICATIONS NETWORKS Robert Hudyma and Deborah I. Fels Ryerson University, 350 Victoria St.

Toronto, Canada rhudyma@ryerson.ca

Traditional techniques and models used to determine the availability and failure rates of telecommunications networks are based on classic failure models such as Mean-time between Failure and Mean-time between Service Outage predictors. Network failures occur for many different reasons and occur in many different forms. These classic models only assume that the failure is caused by a hardware component of the network. With the widespread deployment of Internet technologies other factors that cause or contribute to failure in a telecommunications network must be explored. Two additional failure modes to existing published failure models, failure from Denial of Service Attacks, and failures due to Catastrophic Events have been identified and defined along with an initial outline of a generalized prediction model based on Dynamic System Theory.

Introduction
For over thirty years the US Department of Defense Specification, MIL-HDBK217F, has been a standard measure for estimating the intrinsic reliability of electronic equipment and systems. It is based on an analysis of the average time, in hours, for electronic components to fail called Mean-time between Failure (MTBF) [1]. Several similar standards, such as Bellcore TR-322 [2], as well as many modifications and derivatives have been used to predict the behavior of telecommunications equipment that is currently in production [2, 3]. Even though MTBF determination is well established, the application of this reliability predictor in telecommunications network design is "frequently misunderstood and misused" [4 Page 36]. Research has shown that overly optimistic failure predictions result from misunderstandings and misapplications of MTBF assessments [12]. Despite the misunderstanding and misuse of these predictors, the telecommunications industry is still very much focused on their use. A search of technical documentation at leading telecommunications hardware manufacturers (Cisco and Juniper Networks) shows extensive documentation for failure prediction based on the MTBF and Mean-time between Service Outage (MTBSO) standards and little on other causes of network failures. This collective mindset extends throughout the telecommunications industry where one finds an abundance of information on the use of MTBF predictors and little information on other categories of network failures.

Kyas (2001) has identified five categories of errors that can lead to general system failure in data processing systems that are considerably beyond the MTBF failure predictors [5]. These are: 1. Operator Error 2. Mass Storage Problems 3. Computer Hardware Problems 4. Software Problems 5. Network Problems This research paper reviews the five categories proposed by Kyas (2001) with the view to determine if additional categories are required or, if there in an opportunity to express a general failure predictor model.

Categories of Network Failures


Category 1: Hardware Problems Telecommunications equipment suppliers have focused on Kyas's (2001) category of hardware problems as the main predictor of network failure rates. Approximately 25% of all failures occur as a result of hardware problems such as computer failures [5]. To improve overall hardware reliability in telecommunications products, vendors build redundancy into their product offerings. A network designer can select and deploy equipment with a wide range of redundancy options ranging from having no redundancy to the complete duplication (or more) of equipment and links. When applied in this narrow context, the US Military and Bellcore standards are useful predictors. Today, it is common to have individual hardware components of telecommunications equipment have MTBF's ranging from 80,000 hours to several hundred thousand hours [4]. In the actual deployment of the networks there are variations in more than just the hardware components that are selected. These variations include the quality of equipment, the quality of network planning and design, the complexity of the implementation, the interaction and interoperability of components. Many networks are exceptionally complex systems - it is amazing that they show any stability at all! Mission critical networks are designed to have "five-nines" availability (i.e. 99.999% availability) and are required to meet that performance criteria based on an MTBF assessment. However, there are four other important categories of failure identified by Kyas (2001) that contribute to the other 75% of network failures not identified through an MTBF analysis of hardware problems. These other contributors to network failure rates (or correspondingly availability) must be considered to accurately assess and predict network availability. MTBF analysis is not an appropriate measure for three of these categories. Category 2: Operator Error Operator Error (OE) as defined by Kyas (2001) as those failures caused directly by human actions. Operator error is further subdivided into intentional or unintentional mistakes and as errors that do or do not cause consequential damage. Kyas (2001) suggests that OE is responsible for over 5% of all system failures. This figure typically varies from enterprise to enterprise based on the level of training and other factors such as corporate culture and procedures. This type of error is useful in examining possible types of network system failures. An operator error that affects the network reliability can arise from peoples interaction with networking equipment, physical cables and connectors as well as from events by other IT devices result from user actions. Other IT devices such as database servers and e-mail servers can produce broadcast storms and duplication of network addresses due to the actions of individuals operating the various devices within the network.

Category 3: Mass Storage Devices This category is defined as the failures associated with mass storage devices. Failures of these devices have been studied by various manufacturers as well as by users of these devices. Although high performance hard drives can achieve exceptionally high MTBF values of 106 hours (almost 114 years); many organizations who employ banks of hard drives often experience higher failure rates simply due to the large quantity of drives in use. Furthermore, environmental factors such as: temperature variation, physical handling or mishandling combined with the frequency of certain drive operations such as nonstop seek operations will affect both the MTBF as well as its statistical distribution. The failure analysis can consider these factors into reliability planning of a network. Although the failure of these devices is not by itself considered to be a network failure, there has been a rapid growth in the deployment of Storage Area Networks (SAN) where large arrays of mass storage devices are directly connected to a network through high capacity channels. SANs do indeed classify as network devices since they are network-centric. From a computer hardware perspective, traditional MTBF evaluations are appropriate for these devices. Category 4: Software Problems Today, enterprise networks connect large numbers of servers that provide functionality to large numbers of users using a very large number of software applications. Widely distributed systems are common in enterprises that are geographically dispersed. The network provides all connectivity between various computer platforms and clients. In systems of such complexity, even with careful planning, monitoring and assessment, it is difficult to predict the service demands on the network. Failures can arise from insufficient capacity, excessive delays during peak demands well as a catastrophic failures arising from the loss of a vital component or resource. Network software failures can be caused by faulty device drivers, subtle differences in protocol implementation and handling, and operating system faults and anomalies. According to Kyas (2001) software problems account for approximately the same number of failures as hardware problems (25%) and are important to any meaningful reliability analyses. Category 5: Network Problems Hardware and Software problems that are directly related to the Network are included in this category. These can account for more that one-third of IT failures [5]. To better understand the distribution and nature of these types of failures it is useful to discuss them in the context of the OSI model. Figure 1 shows the distribution of errors among the layers of OSI model in Local Area Networks.. Causes of failures within the lower layers of the model are often defective NIC cards, defective cables and connections, failures in interface cards in bridges routers and switches, beacon failure (Token Ring networks), checksum errors, and packet size errors. As Ethernet technologies have improved over time, there has been a decline in the failure rates within the lower layers of the OSI model but, there has been an increase in the failure rates in the Application Layer as software complexity continues to explode. Many of the errors and failures as described here are often localized (usually to one computer or user) and not catastrophic in nature. Localized failures are very different from that defined by the US Military and the Bellcore models which allow for local failures to occur and not be considered a device failure. In understanding the contribution of localized failure to network reliability, it is important to consider the scale and size of failures that are caused by individual network components. For example, the failure of a NIC card will not likely result in a Single Point of Failure of the enterprise network. However, a Core Router failure without appropriate redundancy and switchovers can incapacitate an entire network.

Application Layer 20%

Presentation Layer 5% Session Layer 5% Transport Layer 15%

Network Layer 25% Data Link Layer 10% Physical Layer 20%

Figure 1. Frequency of LAN errors by OSI model Layers Additional failures not categorized by Kyas (2001) Although Kyas's (2001) five categories account for a large number of network failures, the following two additional categories merit consideration and discussion: 1. Failures due to Denial of Service Attacks (Worms, Viruses, Trojan Horses and Malicious software), 2. Failures from disasters such as fire, flood, earthquakes, outages and the like. Category 6: Denial of Service Attack Denial of Service attacks have been a major source of network failures since 2000 [9]. Today they are occurring several times a year resulting in service disruption worldwide. The frequency of this network failure is increasing at an alarming pace. Only private tightly controlled networks without Internet access are immune from this form of attack by deploying "air-gaps" in the network. Air-gaps are a physical gap with no connectivity and where data is manually transferred between nodes. This approach is not practical for the vast majority of networks today that rely on Internet connectivity. An example of the impact of Denial of Service attacks is the Code Red virus and a more recent variation, Slammer worm, disrupted millions of computers by unleashing a well coordinated Distributed Denial of Service Attack. These attacks resulted in a significant loss of corporate revenues worldwide [10, 11]. The increased frequency of occurrence or threat, and impact of this type of network failure on network disruption (and corporate revenues) are considerable and therefore the Denial of Service Attack category must be included in any valid failure analysis model of an Internet connected enterprise network.

It is possible that even more insidious malicious code will be unleashed to wreak havoc worldwide. Researchers have recently postulated how a virus, dubbed the Warhol virus, could disrupt the entire Internet within fifteen minutes [6]. For example, the Slammer brought Internet service to a halt in India, disabled a million machines in Korea, disabled automated tellers in the Bank of America, affected universities and a major Canadian bank in the course of a few days in 2003. Worms such as Code Red and Slammer are probably authored and unleashed by an individual or a small number of individuals. An even more worrisome threat exists if the malevolent code is part of an information warfare attack. It is well documented that countries such as China have an active development program for waging Cyberwar [7]. This form of attack can cripple an information technology based society (not just one infected network). The threats are quite real and when unleashed create a dysfunctional network until the malevolent code is effectively eradicated. Predicting the percentage of network failures caused by this type of error is difficult because it is such a recent phenomenon with random occurrence. However, the potential impact of this failure is enormous and widespread, and cannot be discounted. Category 7: Disaster Scenarios The final category of failure considered in this paper is that of disaster scenarios which occur from a wide range of circumstances, many of them environmental and some synthetic. Environmental disasters include floods, earthquakes, hurricanes, long-term power outages, tornadoes and fires. Synthesized disasters can include: theft, vandalism, arson, war and acts of terrorism. In each of these disaster scenarios many more causes can be listed. In some cases there is regional occurrence that can be useful in predicting such an event. However, in many other cases no previous knowledge or useful means of prediction is possible. Disaster planning has only recently become a high IT priority as the collective mindset of the world has focused on dealing with the threat of widespread terrorism.

Developing a Comprehensive Failure Analysis Methodology


A number of categories have been introduced which identify the cause and types of failures that telecommunication networks can have. In some cases the estimation of the probability and nature of the failure are predictable and in many others any estimation would be only guesswork and thus be inaccurate. The question now and corresponding challenge is to determine how to proceed. Clearly an assessment can be made in each of the seven categories and quantitative and speculative predictions can be made. These can be prioritized and used as input to the risk assessment of the telecommunications infrastructure. This approach can provide a methodology by which a corporation can assess and respond to a wide range of failures in the network. However, an alternative and perhaps less speculative approach, Dynamic Systems Theory, may be possible. Dynamic Systems Theory first proposed by Thom [8] describes catastrophes as bifurcations of different equilibria or fixed point attractors. It has been used to characterize a large number of natural and synthetic phenomena ranging from insect populations to the capsizing of ships at sea. Certain types of failures in telecommunications systems can obviously be described by this theory. Network failures such as route flapping are natural candidates that can be described by this approach to model the failure. The open research question and challenge is to apply Dynamic Systems Theory to all different categories of network failures and compare the results with the existing models that use MTBF and MTBSO as predictors.

Conclusion
This paper has presented seven categories of possible failures in telecommunications infrastructures that provide a much broader perspective than the general industry methods of MTBF analysis. Furthermore Dynamic Systems Theory is

presented as an approach to consider in describing all categories of failure. Ongoing and future research will use the tools provided by Dynamic Systems Theory to determine the extent that the failure categories can be simultaneously considered.

References
1. U.S. Department of Defense (1991). MIL-HDBK-217F, Reliability Prediction of Electronic Equipment, Washington D.C., U.S. Department of Defense. 2. Bellcore. (2001). TR-322, Issue 6, Reliability Prediction Procedure for Electronic Equipment, Morristown New Jersey,Telcordia Technologies. 3. Ma, B. & Buzuayene, M. (2003). MIL-HDBK-217 vs. HALT/HASS, Retrieved October 2003 www.evaluationengineering.com/archive/articles/1100reliability.htm. 4. Dooley, K., (2002). Designing Large-Scale LAN.Sebastpol California, O'Reilly & Associates. 5. Kyas, O. (2001). Network Troubleshooting, .Palo Alto California, Agilent Technologies. 6. Stanford, S., Paxson, V., & Weaver, N. (2003). How to Own the Internet in Your Spare Time. Retrieved October 2003 from:http://www.icir.org/vern/papers/cdcusenix-sec02/ 7. Thomas, T.L. (2003). Like adding Wings to the Tiger: Chinese Information War Theory and Practice,. Retrieved October 2003 from http://call.army.mil/fmso/fmsopubs/issues/chinaiw.htm 8. Thom, R. (1975). Structural Stability and Morphogenesis an Outline of a General Theory of Model. (First English Edition), Reading Massachusetts, W.A. Benjamin, Inc.. 9. Bosworth, S., & Kabay, M.E., (2002). Computer Security Handbook, Toronto: Wiley Inc. 10. Cherry, S.M., (2003). Internet Slammed Again, Spectrum IEEE, Volume 40, Issue 3, Page 59. 11. Schreiner K., (2002). New viruses up the stakes on old tricks. Internet Computing, IEEE, Volume 6, Issue 4, Pages 9-10. 12. Kuhn, D.R. (1997) Sources of Failure in the Switched Telephone Network. Computer, IEEE, Volume 30, Issue 4, Pages 31-36.

Fujitsu Develops Industry's First System-Failure Management Technology for Cloud Computing Era
- Delivers high-reliability non-stop system services; automatically detects and resolves system failures Kawasaki, Japan, February 23, 2010 Fujitsu Laboratories Ltd. today announced the development of technology that will enable the company to implement the Trusted-Service Platform it has been advocating for cloud computing services, in view of the shift toward the era of cloud computing. As an industry first, Fujitsu has developed technologies that can detect system failures before they happen, by improving the ability to analyze cloud system data and gather information, narrowing down the causes of failures, and automatically resolving them. Cloud systems play an important role in supporting various societal infrastructure systems and must be able to continuously deliver services without interruption. Even in the event that a failure does occur, services must not be interrupted. Through Fujitsu's new technology, it is possible to address cloud system failures before they occur. Furthermore, because failures can be

automatically resolved, the technology reduces the workload of administrators and delivers cloud services that users can utilize with confidence.

Background
Cloud computing is a delivery model whereby remotely located IT resources, such as servers, storage, networks, middleware and business applications, are provided as services over the Internet. Users have the ability to use the functions they need, in whatever amounts they need, and only when they need them. In addition to its use as a platform for further enhancing work efficiency and productivity, cloud computing is also employed as a system for supporting various societal infrastructure systems, like those used in entertainment or lifestyle-related applications. In order to support the creation of a human-centric networked society, whereby IT is employed with an emphasis on people (information and communication technologies, "ICTs"), there is a need for cloud systems to continue delivering secure and stable services non-stop. Traditionally, many companies have addressed system failures immediately after their occurrence. However, because companies cannot afford downtime for cloud systems that play an important role in supporting infrastructure systems, a different approach is required. In addition, large-scale systems thus far have ensured the continuous operation of services through expensive, redundant configurations. In order to deliver high reliability and stability to cloud systems - which aim to operate economically - what is needed is technology that can predict and resolve failures before they emerge.

Technological Challenges
Cloud computing systems have the following characteristics: 1. Large-scale: When companies take existing systems that operate independently and consolidate them into data centers and enterprise IT systems, the scale of the systems increases. 2. Complexity: When companies employ virtualization technologies and operate numerous services on the same physical server, system configurations and system dependency relationships become complex. Given the aforementioned characteristics, when a failure takes place in a cloud system, it can affect various services, in addition to requiring a great deal of manpower and time to locate where the failure has occurred.

Newly-Developed Technology
In order to provide highly reliable and stable services via cloud computing, Fujitsu Laboratories developed a technology that detects failures and averts them before they occur. Specifically, the technology monitors the system, predicts failures, narrows down their causes, and quickly resolves them. (Figure 1)

Figure 1: Previous vs. newly-developed method of detecting and handling failures Larger View

1. Detection of signs of failure (Prediction) Fujitsu Laboratories has developed two technologies to detect signs of failures depending on the type of failure. (1) Detection of failures through the analysis of system messages: This technology focuses on specific patterns in messages that are generated just before failures occur and detects warning signs. By comparing the pattern of generated messages with messages from previous system failures, the technology can pick up on signs of failure. By (1) employing Bayesian learning methods to assign weights to example data from previously generated message patterns, the system can detect signs of failure with great accuracy. (Figure 2)

Figure 2: Failure detection through analysis of system messages Larger View

(2) Detection of potential failures that do not generate messages: When configuring equipment such as servers, human error can lead to the input of incorrect settings. In this kind of situation, the server will operate according to the settings and may not generate any error messages. An effective method for detecting failures in this instance is to gather and analyze data packets that travel across networks that link servers and systems, and then analyze minor changes on the packet level - such as data loss, resent packets and

transmission delays. In order to monitor large-scale systems that are involved in cloud computing, Fujitsu Laboratories has developed a technology that is compatible with 10Gbps high-speed communication technology, and which detects network and server system failures in real time. 2. Narrows down causes of failures The technology scans through detected signs pointing towards system failure and makes inferences about the most likely areas that have generated these signs. Using the observed symptoms as a point of origin, the technology employs network and system configuration information to trace the symptoms' causes. It then overlays the results of evaluations taken from multiple points of origin, generating inferences about the most likely causes based on the areas with the most overlap or with no proper activities. 3. Resolves causes of failures The system leverages past knowledge of how to deal with system failures, including system log information, and presents administrators with the most suitable methods for dealing with the determined causes of the failures. Due to the fact that previous failures will often occur again, the system stores previous cases of system failures and the procedure history to resolve them in its knowledge base, so that it can quickly determine a solution in order to resolve the cause of the failures.

Results
With this new technology, Fujitsu is able to quickly address cloud system failures, allowing the delivery of high-reliability, continuous-operation cloud system services to its customers. In its own internal systems that employ the technology, Fujitsu has been able to detect instances of mistaken system settings prior to errors actually occurring. In addition, Fujitsu has been able to reduce the average time required to resolve failures from an average of 15 minutes to approximately one (1) minute.

Future Developments
Fujitsu plans to gradually deploy this technology in its On-Demand Virtual System Services and LCM services, on its Trusted-Service Platform.

http://www.fujitsu.com/global/news/pr/archives/month/2010/20100223-01.html

Fujitsu Unveils Software for Detecting Network and Cloud Environment Failures
Analyzes network communications response and data volumes, while supporting the building and operation of high-quality cloud environments
Tokyo, December 20, 2010 Fujitsu today announced the launch of a software package for managing network monitoring services available only in Japan. Proactnes II SM V01 monitors and analyzes the communications response and data volumes (packets) of networks found inside large-scale datacenters that provide cloud services, as well as the Internet which connects users to cloud services. The software detects early signs of failure by pinpointing areas in the cloud where these problems arise. The new software prevents system problems from occurring in a customer's private cloud, public clouds offered by cloud service providers, as well as existing ICT systems. At the

same time, it allows customers to improve operating efficiency while enabling them to quickly restore system operations if a failure does arise.

Background
With the spread of cloud computing, datacenter ICT resources and customer ICT systems are becoming increasingly larger and complex due to the deployment of virtualization and automation technologies. As a result, in the areas of both private cloud operations management and public clouds that deliver cloud services, companies have to deal with the growing burden of understanding the configuration of virtual systems and ensuring their stable operations. To address this challenge, Fujitsu is launching Proactnes SM V01, a software package for managing services that monitor networks both inside and outside of datacenters.

Software Features
Proactnes II SM V01 employs cloud computing system-failure detection technologies introduced by Fujitsu Laboratories Limited in February 2010. An industry first, these technologies can detect system failures before they happen, with improved ability to analyze cloud system data and gather information, thus narrowing down the causes of failures and resolving them successively. The new technology is being employed at Fujitsu's Tatebayashi System Center, which provides customers with cloud services, to help monitor the stable operations of Fujitsu's services, thus ensuring a service level of 99.99%. 1. Uncovers failure areas in real-time By analyzing the response of communications across a network, the system pinpoints exactly where a failure has occurred, on both the datacenter's internal network and external networks that connect users to cloud services. This enables administrators to quickly uncover communications problems, a cause of failures that is difficult to track down, thereby making it possible to restore system operations when trouble occurs at an early stage. 2. Detects signs of system trouble Based on the correlation between the volume of data (packet) traveling across a network within a datacenter and the CPU usage assigned to each virtual server (virtual machine), the technology identifies virtual servers experiencing unusual behavior, allowing it to detect early signs of system trouble. 3. Makes visible the virtual systems of cloud service users In cloud computing, it is generally difficult to determine which physical servers a user's virtual server is operating on. However, this software package uses network device settings and other information to make the configuration of users' virtual systems visible. This eliminates the complexity involved in configuration management for system administrators, making it possible to quickly address system failures if they arise.

Pricing and Availability


Software Package Name Proactnes II SM Service Visualization Pricing (excluding tax) JPY Availabili ty Late

Basic License V01 Proactnes II SM Server Configuration Information Collection Basic License 1 V01 Proactnes II SM Failure Detection Basic License 1 V01

300,000/month~ JPY 150,000/month~ JPY 150,000/month~

January 2011

Sales Target
25 billion yen by the end of fiscal 2013 (fiscal year ending March 31) in sales of Proactnes II series products, including related hardware, software and system development.

About Fujitsu
Fujitsu is a leading provider of ICT-based business solutions for the global marketplace. With approximately 170,000 employees supporting customers in 70 countries, Fujitsu combines a worldwide corps of systems and services experts with highly reliable computing and communications products and advanced microelectronics to deliver added value to customers. Headquartered in Tokyo, Fujitsu Limited (TSE:6702) reported consolidated revenues of 4.6 trillion yen (US$50 billion) for the fiscal year ended March 31, 2010. For more information, please see: www.fujitsu.com.

http://www.fujitsu.com/global/news/pr/archives/month/2010/20101220-01.html

As information and systems increasingly move into virtualized environments whether they are cloud data centers run by managed service providers, internal private clouds, the wider Internet, or a combination of all three the problem of being able to guarantee high availability and service levels becomes ever more tricky. Effective network management and monitoring tools that give adequate visibility and control of these increasingly complex environments are therefore vital. A single point of failure, or bottleneck, could affect the performance of any number of applications and services, degrading users experience and critically in the case of cloud service providers, which are contractually committed to delivering certain levels of availability denting a businesss bottom line. Its perhaps not surprising, then, that big vendors are at pains to persuade organizations that they have the tools for the job. Fujitsus Proactnes II SM V01 software, announced last year and available in Japan since January this year, is described as an industry first. The technology can detect system failures before they happen (see Data Feed, below), and the software offers a notably improved ability to analyze cloud system data and gather information, thus narrowing down the causes of failures and resolving them successfully. Fujitsu is now using the software at its Tatebayashi System Center, which offers customers cloud services, and says as a result it is possible to ensure service levels of 99.99%. As CIOs weigh up the decision to move their IT into the cloud, such assurances from potential providers will be ever more key to their confidence in migrating mission-critical systems and services. But, as analysts point out, monitoring data supplied by cloud vendors is probably not enough. If CIOs truly want to know whether their SLAs are being met, they also need to maintain their own logs of what users are experiencing to ensure their provider is achieving what it claims.

Data feed
By analyzing the response of communications across a network, the Fujitsu software locates where a failure has occurred, on both the data centers internal network and external networks connecting users to the cloud. By correlating the volume of a data packet traveling across a network and the CPU usage of each virtual machine, the system identifies unusual behavior at an early stage. Processes are simplified by using network device settings and other details to make the configuration of users virtual systems visible.

Das könnte Ihnen auch gefallen