Sie sind auf Seite 1von 7

Distributed Control System Reliability

Emmanuel G. Diaz & Iliya A. Dormishev

I. ABSTRACT In this paper, the importance of a reliable distributed control system is discussed. Focus is placed on reliability as a measure of how survivable a system is in terms of security, redundancy, robustness, and resilience. These principles are applied to a testbed emulator for application purposes in the development of a teaching tool on distributed control system reliability.

II. INTRODUCTION Societys dependence on computer technology has increased tremendously in the past twenty years. These technological advances have lead to the use of Distributed Control Systems (DCS) in controlling and automating many necessary processes. Examples include air traffic control, electrical power distribution, water treatment, and distribution plants. Societys assurance on these systems lies within its reliability. An example of failure in a DCS is the recent power failure that affected about 50 million people in Eastern parts of the United States and Canada in August of 20031. This incident shows how DCS reliability is important to the society depending on it.


1.) Security 2.) Redundancy 3.) Robustness 4.) Resilience


In Barry C. Ezells thesis on vulnerabilities of a Supervisory Control And Data Acquisition (SCADA) system2, it is said that a reliable DCS depends on its survivability. Survivability is a measure of how a system will function and recover after failure and is dependent on the systems: 1.) Security, 2.) Redundancy, 3.) Robustness, and 4.) Resilience. These principles directly involve the hardware and software of the system, and are the focus of this study. A simple water distribution emulator, SCADAVille, was used to develop a teaching tool for DCS and SCADA systems with a focus on softwareoriented solutions. This in turn defines each measure of survivability by comparing implementations on the emulator to a real life system, showing the impact of these principles on DCS reliability.

hill tank (pressure sensor)

reservoir (pressure sensor)

manual load

village tank (pressure sensor) hill pump (motor)

reservoir pump (motor)

manual load


river pump (motor)

Figure 1: SCADAVille Testbed

river (water source)

Figure 1: SCADAVille Testbed

Figure 2: Programmable Logic Controller (PLC) Training Unit

SCADAVille (Figure 1) is an emulator of a basic three-zone water distribution system constructed by professors and students from the David Crawford School of Engineering at Norwich University. Each zone includes one of three water tanks: Hill, Village, and Reservoir. Each tank has a manual valve to replicate water usage and is connected to a water pump. This DCS is controlled by four Allen-Bradley Programmable Logic Controller (PLC) system units (Figure 2). Each unit contains a PLC5 processor and the following Input/Output modules: Analog, DC, and AC. The units also include a thumbwheel, an LED display, a VU-Meter, lamps, and switches. To program and monitor the PLCs, the ladder logic software RSLogix was used. The three PLCs directly connect to electronic cards that drive a water pump and receive data from the system such as water level. Each PLC or Remote Terminal Unit (RTU) controls each zone. All three RTUs are networked together and controlled by a fourth PLC, or Master Terminal Unit (MTU), creating a SCADA system. Each RTU reads the water tank level using a pressure transducer attached to each tank, and turns the pump on or off depending on a command from the MTU. The MTU acquires data from all three RTUs and sends commands to turn a pump on or off depending on water levels. The system is monitored through a Human Machine Interface (HMI) programmed using LabView (Figure 3), and it is where an operator controls the entire system. The operator chooses for each zone an Upper Trip Point (UTP) and a Lower Trip Point (LTP). The LTP is the limit where the pump will turn on, meaning that a tanks water level is too low. The UTP refers to the limit where the pump is turned off as the water level is reaching the tanks capacity. The operator can also manually turn any pump on or off, as well as receive alarms corresponding to any errors that occur.

Figure 3: Human Machine Interface (HMI) for SCADAVille System

III. SECURITY Security deals with prevention, detection, and defense from internal and external attacks. If there is nothing preventing a hacker from breaking into SCADA networks, anyone with enough knowledge of computers can disrupt communications, change program parameters, and eventually cause the DCS to malfunction. This principle appears to be the most basic to understand but at the same time the hardest one to implement, largely due to vulnerable software. Security is hindered by the inability to anticipate the creative mind of those hacking the system. Therefore, it is important to design a system that provides protection against unauthorized access to operator displays, control operations, database modification, and access to applications and critical functions. The use of SCADAVille allows for a better explanation of how security is a part of a survivable system. This systems security relies on its closed network instead of its abilities to deal with attacks, thus making unauthorized access through outside hacking impossible. According to the US Department of Energy, a way to improve cyber security is by disconnecting any unnecessary connections to the SCADA network4. In industry applications, SCADA systems use telephone lines, radio, or different types of networks to transfer information. The system can be accessed through the network or an existing internet connection. This in turn makes it accessible and open to attacks that can result in Denial of Service (DoS) or information tampering. If SCADAVille were changed so that it resembled a real system more closely, implementation of security would need to begin in dealing with prevention of cyber attacks. To simulate this type of attack, a separate program was used to send large amounts of data packets to the HMI, flooding it with phony traffic. As a result, communications between the HMI and MTU were disabled by a DoS. To secure the system against this type of attack, a preventive measure was created to inspect the data coming into the HMI. A data filter looks through the data coming into the system and only accepts data packets matching the previously set port number and IP address of the MTU disregarding the rest. This demonstrated an implementation of security in the form of prevention and detection. Another way to add security to the system is to create a defense once an attack occurs. In the testbed, if an attacker is able to penetrate the system undetected and change data values, the program code design is able to handle unreasonable data. For example entering negative numbers for the UTP and LTP, could result in physical pump damage and service interruption. The code would defend the system by detecting that these values are not within a reasonable range, rejecting them, and instead using a set of default parameters allowing for minimal interruption of service and physical damage to the system. In addition to security in communications, multiple layers of security need to exist so that if an attacker makes it through one, there will be others to contain and make it more difficult to gain access into sensitive areas. This implementation can be in the form of alarms, data encryption, and specific sector privileges for employees. The addition of passwords and administrator privileges allows the chief operator access to the complete system, while a technician is limited to access only areas that need servicing. This not only protects the system from external attacks, but also from internal attacks.

Internal threats can be conducted by a disgruntled employee or caused by accident. This includes social engineering, email viruses, or unintentional security breaches. Constant updates to the software and hardware protecting a system need to be made in order to keep up with the rapid development of technology.

IV. REDUNDANCY Redundancy in a system refers to the ability of certain components of a system to assume functions of failed components without adversely affecting the performance of the system itself1. A vulnerability assessment has to be made in order to find out where a system can be brought down by a single point of failure3. This means that if failure occurs to any single component of the system, it will result in total loss of service. This can be applied to the water distribution testbed so that its application and importance is seen in a real life system. In terms of hardware, a system needs to have backup components that will take over if failure occurs in the main ones. An assessment of SCADAVille concluded that it does not have any secondary or backup components, which leaves room for a chain reaction of failures to occur making the system unreliable. The only redundancy implemented are overflow valves that allow for the water tanks to drain automatically when they are full and water is still being feed into them. Many problems can occur in a system with a lack of redundancy. For example if there was a failure in any one of the three pumps, one or more zones are left without replenishing water. If extra pumps are added, they can take over while the main pump is being serviced, allowing the system to continue uninterrupted. Similar examples are the tanks, water lines, and drain valves. If the valves were to fail, either left open or closed, there is a need for another valve to stop the flow of water or to allow the water through a different path ensuring water delivery. In order to apply the theory of redundancy to this system, different safeguards are added to the original design. The immediate solution is to include backup components that would take over once failure occurs. The systems layout also needs to be considered, since a single failure can take out both primary and secondary components; such would be the case in a fire.
System Operator HM I Primary Secondary Server Redundancy

Network and Cable Redundancy PLC Redundancy Primary Secondary

Figure 4: Redundant Communications Layout for a Distributed Control System (DCS)

The system can also improve its survivability by improving its communications and controls. When it comes to mediums of communication, improving redundancy would include adding extra lines and means of communication between RTUs, MTUs, servers, and any monitoring HMI attached to the system. This may be in the form of numerous dedicated servers and wiring to these servers. It also includes different ways of communicating. If the RTU is communicating using a local area network (LAN) and the network goes down, an MTU can reconnect using a modem through telephone lines. An example of this is the simple layout in Figure 4.

V. ROBUSTNESS The third principle of survivability involves a systems robustness. It refers to the degree of insensitivity of a system design to errors 2. In other words, it asks the question: how will the system components function during abnormal conditions? A robust system will maintain all components, both software and hardware, functioning well enough to keep the system running after an attack or failure. In order to apply this principle to SCADAVille, focus was placed on its software. The target software was that controlling the RTUs and MTU. A code needed to be implemented in the system that had the ability to function under abnormal conditions. In terms of data transfer, if the MTU stops communicating with the HMI, it will no longer be able to exchange information between the operator and the system. This will not necessarily result in a system shut down, but it is not good when the operator looses control of the entire system. Code in each of the terminal units can be changed so that if there is a failure in communication, first it will be detected and then a new set of rules are given to the system. The purpose of this set of rules is to keep the system stable under the presence of failure and add fault tolerance. These rules would make sure that water is kept above a certain level in each tank, using predetermined values, and that the pumps are only on when this condition is not met. A check for network availability is also made to restore normal system operations as soon as possible. Suppose that instead of communications failing, an attacker succeeded in changing system parameters such as the trip points of each zone, creating a water resource problem. This can be prevented by checking that the LTPs are less than the UTPs or that they are equal. If this happens, a set of rules will trigger alarms and if an operator does not make appropriate actions, the systems robustness will take over by not accepting the new data or reverting to default numbers. Now that the system will remain operational after an attack or failure, the last principle deals with how the system behaves after the incident.

VI. RESILIENCE Resilience is the ability of a system to operate close to its intended design, technically and institutionally, over a short time after the attack, such that the losses are within manageable limits 2. This concept helps make the system reliable by acting after

an attack is made on the system. Resilience applies not only to the system itself, but also to those individuals that operate it. Basic examples of resilience are virus protection programs and different types of protocols for technicians and operators to follow in case of failures due to an attack. Systems that have connections to the Internet run the risk of having the system infected by a computer virus. This virus protection software is constantly searching through system files for a virus, and if found, destroying it before it corrupts any data. In addition, if an attack is successful, a plan of action has to be in place so that system operators and technicians can deal with the situation to bring the system back online as soon as possible. These two examples can be used to understand the idea behind resilience when applied to a DCS. In SCADAVille, resilience is limited to the code used to program the RTUs and MTUs. The code has numerous checks between each unit and the supervisor, which creates a set of alarms that will be displayed through the HMI. From here, operators would attempt to correct the errors, and return the system to its normal operational mode. If the system looses communication between each unit, a special mode is entered by the entire system, where the system is running, but very inefficiently. This is a stand-alone mode where each zone becomes an isolated control system. Each pump is controlled directly by the RTU signals sent or received by the MTU. This mode has set parameters, which will operate the pump so that it is always above 50% capacity but less that 90%. To build further on the resilience of the system, a safety-instrumented controller can be added that can take over the RTUs duty for a limited amount of time if the RTU fails. This controller would constantly be checking the status of the RTU so that it can switch on immediately after failure of the RTU. This controller will allow the system to remain online while repairs are made to the primary unit, thus contributing to the overall survivability of the system.

VII. CONCLUSION The automation industrys rapid movement toward computer controls has increased the importance of reliability in distributed control systems. This reliability is directly tied to four concepts that measure its survivability. Security is a systems first line of defense, where different layers work on detecting, defending, and deterring attacks. A reliable system depends on components getting the job done; therefore, redundancy ensures that these components have backups so that interruption of service is delayed long enough for repairs to take place. If an attack is successful, a robust system will continue uninterrupted functioning at close to normal operation. Resilience connects everything together by having a system return to normal or as close to normal operation after a successful attack occurs. By focusing different resources to these four areas in a system, the world can continue to benefit from the automation and control applications that involve reliable distributed control systems.

References: 1.) Final Report on the August 14, 2003 Blackout in the United States and Canada: Causes and Recommendations. April 2004. (10 March 2005). 2.) Ezell, Barry. Risks of Cyber Attack to Supervisory Control and Data Acquisition for Water Supply. May 1998 SCADA-thesis.html (11 March 2005). 3.) Jesperson, Kathy. Water Systems Should Polish Security Plans. On Tap Magazine Winter 2002. /OT/WI02/WaterSys_SecPlans.html (13 April 2005). 4.) 21 Steps to Improve Cyber Security of SCADA Networks. (28 March 2005). 5.) Boyer, Stuart A. SCADA: Supervisory Control and Data Acquisition. North Carolina: Research Triangle, 1993. 6.) Trends in SCADA for Automated Water Systems. November 2001. (1 April 2005). 7.) GE Fanuc Automation. August 24, 2001. products/brochures/servredn.pdf (15 April 2005).