Sie sind auf Seite 1von 17

NOUVELLES CONSULTANTS

Software and Product Safety


EG2401 Project 7 Final Report
Muhammad Ahad Muhammad Nadzri Hussain Ruben S/O Sukumar Farhan Bin Mohamad U084075R U096036E U084705E U090387B

3/10/2011

An analysis of the technical, security and ethical issues involved in the Ariane 501 Rocket and Patriot Missile Defence System failures that were caused by software malfunction.

EG2401 Project 7 Final Report Tutorial 308

1.1 - Ariane 501 case

1.1.1 Case overview On June 4, 1996, the maiden flight of the Ariane 501 launcher ended in a catastrophic failure. About 40 s after initiation of the flight sequence, at an

altitude of 2700 m, the launcher veered off its flight path, broke up, and exploded. The accident and the inquiry report claimed that there was a complete loss of guidance and attitude information, 37 s after the start of the main engine ignition sequence and 30 seconds after lift-off.

1.1.2 Immediate Follow-Up Action Immediately after such a disaster, the engineers from the Ariane 5 project teams of Centre National dEtudes Spatiales (CNES) and Industry immediately started to investigate the failure. Also the Director General of European Space Agency (ESA) and the Chairman of CNES set up an independent Inquiry Board to examine the cause of the failure. Investigations were carried out on the causes of the failure, the systems supposed to be responsible, and any failures of similar nature in similar systems, and events that possibly could be linked to the accident.

1.2 - Factual Issues

1.2.1 General Failure Analysis Consequently, we will we be analysing the number of systems that failed, the causes and conceptual issues regarding the causes of failure. The investigation of its flight from the point of take of has to be thoroughly assessed. The sequence of technical data that was traced from the flight data records and the events that caused the failure are summarised in the table below.
System SRI and Active SRI Solid boosters Event failure of the back-up Inertial Reference followed by the failure of the active Inertial Reference System swivelling into the extreme position of the nozzles of the two solid boosters Page 1 of 17

EG2401 Project 7 Final Report Tutorial 308 Vulcain engine Rupture of links Selfdestruction of the launcher rupture of the links between the solid boosters and the core stage due to the rupture of links failure caused launcher to veer abruptly

Table 1: Information extracted from (LIONS, 1996)

Hence, considering the sequence of events it was evident the root of the problem or the initial error that triggered the sequence of failures was the flight control system which is characterised by the SRI as it failed to function accordingly.

1.2.2 Function of SRI The function of the SRI was primarily for the purpose of measuring the launcher characteristics and its movements in space. Having an independent control system (internal computer) which in turns calculates angles and velocities based on information from a "strap-down" inertial platform, with laser gyroscopes and accelerometers. The data from the SRI to the On-Board Computer (OBC) executes the flight program and controls the nozzles of the solid boosters and the Vulcain cryogenic engine, via servo-valves and hydraulic actuators. (LIONS, 1996). Two SRIs work in parallel and there are also two OBC to ensure reliability and at any time only one is active and the other is in standby. If failure is detected in a single system, it automatically switches to the other unit only if it is functioning properly.

1.2.3 Why SRI Failed? The software design which was used in SRI in Ariane 4 (An earlier version of the launcher) was reused in the SRI design in Ariane 5. However after the investigation it was discovered that the software error occurred in the active SRI and the main control system did not switch to the other SRI due to the similar software problem. The internal SRI software error was caused during
Page 2 of 17

EG2401 Project 7 Final Report Tutorial 308

execution of a data conversion from 64-bit floating point to 16-bit signed integer value. The floating point number which was converted had a value greater than what could be represented by a 16-bit signed integer. This caused a command error which resulted in internal alignment function known as horizontal bias (BH) which is related to the horizontal velocity sensed by the platform. This resulted in the shutdown of both SRIs.

1.3 - Conceptual Issues 1.3.1 Culture of the Ariane program The ethics behind the software design is extremely complex to deal with. However the one of the cause could be the culture of the Ariane Program which only addressed the random software failures which could be efficiently dealt with by back-up systems. The design of the software was such where the 2 SRIs would shut down when an exception was raised is an unnecessary function in the software. On an ethical context, this type of failure should have been simulated even before failure. The process creating such complicated software involved addressing problems such as possibility due to failures. However according to the inquiry report a bias nature was adopted in solving of random failures. Since the supplier of the SRI was only following specifications or design requirement provided by the host company, which stipulated that in the event of any detected exception the processor was to be stopped. However the exception which occurred was not due to random failure but a design error. Even though the exception was detected it was inappropriately handled because the view had been taken that software should be considered correct until it is shown to be at fault. The board has reason to believe that this view is also accepted in other areas of Ariane 5 software design. After complete investigation the inquiry board was in favour of the opposite view, that software should be assumed to be faulty until applying the currently accepted best practice methods can demonstrate that it is correct. (Aviation Week and Space Technology, 1996)

Page 3 of 17

EG2401 Project 7 Final Report Tutorial 308

1.3.2 No proper testing done This is a grey area that needs to be addressed. Years of testing were done, but much emphasis was only hardware failures and problems. Due to the critical nature of the software in space launchers, proper modelling of software design is vital to prevent disasters. The extensive tests conducted during the Ariane 5 development program did not include proper analysis and testing of the inertial reference system or of the complete flight control system, which could have detected the potential failure. Even though it is not possible to test the SRI as a ''black box'' in the flight environment, unless one makes a completely realistic flight test, it is still possible to do ground testing by using simulated accelerometer signals in accordance with predicted flight parameters, while also using a turntable to simulate launcher angular movements. Such a simple and accurate test been could performed by the supplier or as part of the acceptance test. This would have exposed the failure and the problem would have been rectified. (Aviation Week and Space Technology, 1996)

1.3.3 Adequate Software Review From the inquiry report, it is evident that no proper software review was done prior to launch or hardware simulations. No stringent rules or restrictions were in place to ensure that software was up to standard and met requirements for further testing. For such space missions one should note that thousands of functions can be correctly performed and one mistake can be mission catastrophic. However mistakes can be prevented by oversight, test, and independent analysis. A proper software review was much needed to avert such a failure.

1.3.4 Management and Organizational Factors The coordination between supplier of SRI and Ariane 501 team and also among Ariane 501 team has to be questioned. Furthermore, the Ariane 501 accident report is almost totally silent about organizational structure problems as it does not describe the allocation of responsibility and authority for safety nor does it mention any organizational or management factors that may have influenced the accident. Proper procedure manuals were absent and problems
Page 4 of 17

EG2401 Project 7 Final Report Tutorial 308

that surfaced might have been ignored by senior aviation specialists. The reason behind this lack of communication remains a mystery and is hard to prove or totally dismiss that such problem existed. From the examination of accident data, there were signs the initial problems were not properly dealt with by the Ariane 501 team. Also, a more transparent organization of the cooperation among partners in the Ariane 5 would have averted the disaster. With excellent cooperation among team players, together with clear cut authority and responsibility, is vital to achieve system coherence. In most similar magnitude disasters, inadequate transition from development to operations was the key reasons cited for failure. Engineering management has at times sometimes has a tendency to focus on development and to put less effort into planning the operational phase.

1.4 - Ethical Issues The ethical issues for this case have to be carefully examined to identify mistakes, irregularities, weaknesses in management, development and operation problems and the overall co-ordination. Issues will be examined with due regards to software failure and software ethics. The cost of development of Ariane 501 which was around $7 billion and the total lost was around $370 billion US dollars and this makes the financial lost is extremely drastic. Hence problems will be addressed form a ethical point of view.

1.4.1 Utilitarianism From rough cost benefit analysis, it should be noted that the cost of developing proper software involves around 2 years, involving a lot of team players which translate to man ours, and financially costing around billions. The benefits here are not short term, but considered to be long term. Reputation is also at stake here and hence it is evident that cost-benefit analysis cannot be reduced to just dollars here. So when the engineers who designed the software by implementing the same software of Ariane 4 on the Ariane 5 without evaluating all possible aspects or scenarios that will result

Page 5 of 17

EG2401 Project 7 Final Report Tutorial 308

due to this. Under the Rule Utilitarianism approach, certain good practices were broken and this resulted in a bad design of the software.

1.4.2 Duty ethics One might come to the conclusion that duty ethics is not of mush importance as there was no fatality. However the software designers had a professional duty to fulfil. They had a task of producing complex software for the engineers who were working with the hardware design. Since the inquiry report had clearly stated that proper testing of the software was not done efficiently to determine potential problems, the duties of the software engineers are seriously flawed. Apart from the software engineers, the coordination of key Ariane 501 team players was lacking and everyone had a responsibility to work together to ensure that the launch was successful. Communication breakdown together with failure to exercise proper responsibilities have illustrated the importance of duty ethics. The duties of the higher management come into serious question. They should have initiated proper software reviews for the Ariane 501which could have detected the error at the start. Every engineer in the 501 program has the right and duty ensure a successful launch.

1.4.3 Virtue Ethics Certain actions of engineers and professionals do not reflect good character traits. Loyalty to the profession is an ethical code all engineers should follow. Apart from this, loyalty to employers is also a character trait that successful engineers should have. Such qualities were absent from the Ariane 501 industry partners and engineers. Negligence and careless nature of the software design team was reflected in the post- accident analysis by experts. Such a negligent attitude has cost billions and has left a huge dent in the Ariane reputation. 1.4.4 Self Realization Ethics The aspect Ethical egoism is also an issue here. The idea of long-term wellbeing rather than a narrow, short-sighted pursuit of immediate success or pleasure have to be examined carefully. From the whole process from research
Page 6 of 17

EG2401 Project 7 Final Report Tutorial 308

to development, testing and launching, it can be concluded that the search or mind-set for immediate success was present among top management. Hence careless or negligent attitude might have resulted from this trait. Such a rush to ensure success has caused such a horrendous catastrophic failure.

1.5 - Recommendations and current practices Based on the above analysis and the inquiry report following recommendations would have helped to minimize the possibility of the Ariane 501 disaster.

The alignment function of the inertial reference system should be switched off as soon as lift-off occurs. It should also be ensured that no un-necessary software runs during flight. The system should be thoroughly tested by using as much real data as possible. Also, flight simulations should be conducted before any mission. All sensors should be engaged and send their best captured data. Thus, no sensor should be allowed to stop sending its best effort data. Furthermore, specific software qualification reviews should be conducted. Industrial Architects or whoever is in charge should participate in such reviews and report on complete system testing together with equipment analysis.

Additionally, all flight software (including the embedded software) should be comprehensively reviewed. o This includes checking the assumptions (explicit and implicit) in the software code against the constraints on equipment usage. o The range of values taken by variables in the software should be realistic. o Potential problems in the on-board computer software should be proposed by the project team and then reviewed by a group of external experts.

Also, whenever it is technically possible, exceptions to tasks should be confined and backup capabilities should be devised. More data should be provided to the telemetry when a component fails, so that it would be less essential to recover the equipment. The definition of critical components should be reconsidered, by taking failures of software origin into account, especially single point failures.

External parties should also be consulted while reviewing specifications, code and justification documents. Such parties should not be involved in the project and should
Page 7 of 17

EG2401 Project 7 Final Report Tutorial 308

consider the substance of arguments rather than check that verifications have been carried out. The trajectory data should also be included in specifications and test requirements. The test coverage of existing equipment should be reviewed and extensions should be made where necessary. The justification documents should be regarded as important as the code and techniques for ensuring consistency between code and its justifications should be consistent.

A team should be set up which would prepare the procedure for qualifying software, propose stricter rules for confirming such qualifications and ensure that the specification, verification and testing of the software are of a consistently high quality in the Ariane 5 programme. This should also include external experts.

Lastly, there should be a more transparent organization of the cooperation among the Ariane 5 programme partners. Engineers should collaborate very closely and there should be clearly defined roles of authority and responsibility to ensure system coherence. 1.6 Conclusion The Ariane 501 failure is a multi-faceted example of how software errors, lack of adequate testing, maintenance and coordination can lead to big disasters, resulting in a multi-million dollar loss.

There are several issues involved in the failure. The usage of the old specifications from Ariane 4 was not compatible with Ariane 5 (since its path was considerably different). The testing performed was quite inadequate and did not take into account many factors involved. The software-equipment compatibility was also not tested to as far an extent as possible. The various parties involved in the project did not have a clear, transparent communication interface. Thus, all of these issues led to the failure of Ariane 501. We saw that eagerness to achieve success can often lead to short-sightedness and make organizations overlook essential safety precautions and thorough testing which can result in disasters like this. Hence, it is important to adopt a strategy of comprehensive error testing, maintenance, coordination between teams involved in the project and ensuring that different system components are highly compatible. In order to achieve these objectives, it is also very helpful to consult third party or non-partial experts that can review and further check the system for any discrepancies.
Page 8 of 17

EG2401 Project 7 Final Report Tutorial 308

2.1 - Patriot MDS case 2.1.1 Case overview In 1991, a war in the Middle East broke out commonly known as the Gulf War. The war was led the Americans and Britain against Iraq, the army of Saddam Hussein. Operation Desert Storm was the name given for the battle occurred in the Gulf War mainly because of the operations and military participation. In February 25th, 1991, a Patriot missile defence system, failed to track and intercept an incoming Scud missile during the Operation Desert Storm in Dhahran, Saudi Arabia. This failure claimed 28 American soldiers in an Army Barrack and is considered to be one of the worst software failures in history.

2.2 - Factual Issues When the Patriot missile defence system was first invented, it was originally meant to operate in Europe. It was used as a defensive measure against the Soviets cruise missile and medium to high altitude aircraft missiles, which were able to travel up to twice the speed of sound. The Patriot system provides good mobility so as to prevent contact. Because of its inconsistent location, it can only be operated for a few hours before moving on to its next location. In Operation Desert Storm, the Patriot missile system was deployed for the very first time against the Scud missiles. The scud missiles, they can reach speeds of up to five times the speed of sound. Scud missiles were new to the Army and any data regarding the missile were very scarce. The Army had to rely a lot from external sources especially from the intelligence agency. Without proper knowledge or data of the Scud missile would make the Patriot missile system irrelevant. Therefore it was very important that the Army had to understand how tracking and intercepting the Scud missiles could be achieved effectively. According to reports The Patriot Battery at Dhahran failed to track and intercept an incoming Scud missile because of a software problem in the systems weapons control computer. This resulted in an inaccuracy in the calculation that worsened as
Page 9 of 17

EG2401 Project 7 Final Report Tutorial 308

the system operated longer. Based on findings, the battery has been operating for 100 hours. And by then the inaccuracy was so serious enough to cause the system to look in the wrong place for the incoming Scud missile. The irony was that the Patriot defence system had never been used before to defend against Scud missile nor was it able to operate in long operating hours. Reports indicated that Army officials received a data from Israeli two weeks before the incident. The data indicated some loss in accuracy after the system had been running for 8 hours. However, modifications were made to the system to improve the software but it did not reach Dhahran until February 26, 1991. The day after the Scud missile hit the Army Barrack.

2.3 - Conceptual Issues The weapons control computer is actually the most important lifeline for the Patriot missile system. However, the Patriot missile system that was used in Operation Desert Storm was designed back in 1970s. This shows a very huge limitation in precision tracking and intercepting, which is the main purpose of the missile. To understand how radar works, it sends out electronic pulses in the air that is surrounding it. These electronic pulses when hit a target would bounce back to the radar thus giving information such as distance and location. It is similar to the working principle of the Patriots missile system. An additional feature to the system is the ability to select specific missile types that is required for the operator. And for the case of the Patriots missile system, Scud missiles were selected as the target of choice. The system's range-gate algorithm is the important factor in the accuracy of the missiles position. When a flying object has been detected by the Patriots radar, the range gate area would calculate only the area that the object is flying and filters out anything thats outside the area. Only when the object is in the area and calculated, then can it be confirmed that the object is a scud missile. The range gate area requires that the missile be exactly in the centre. And it follows the same throughout where the system would next look out for it.

Page 10 of 17

EG2401 Project 7 Final Report Tutorial 308

Data came running through on 11 February 1991, that there was a 20% shift in the radar range gate area after the system has been running for 8 hours consecutively. This shift meant that the target was not in the centre of the range gate area and it proves to be a hazard for the system. A shift in target would mean that the radar could lose track of the missile and no longer intercept. As stated in the incident report Patriot Project Office officials said that the Patriot system would not track a Scud when there is a range gate shift of 50% or more. Based on simple calculations a 50% shift in the range gate area would approximate to about 20 hours of continuous operations. Basically in conclusion, running the Patriot system for more than 20 hours would spells danger as the inaccuracy becomes too great. Thus, this results the radar to not be able to detect the scud missile. For the case on the 25 February 1991, the Patriot radar system has been running for 100 hours consecutively and has shown how fatal and deadly the system could be.

2.4 - Ethical Issues Assumption made by the army officials that the Patriots were not running their systems for more than 8 hours at a time was one of the key factors that caused the disaster. However, when the Israel data was analysed and loss in targeting accuracy was detected, the officials made software changes which rectify the inaccurate time calculation. The change made it possible for the system to run longer and was part of the modified software version that was released on February 16, 1991 On February 21, 1991, the Patriot users received a message from the Patriot Project Office. It was stated that running the system for a long period would result in a shift in the range gate, which would cause in the target being offset. It was also stated in the message that software update was being sent that would amend the system targeting. On the other hand, the message did not specifically narrate the definition of very long run times. This was because the army officials assumed that the users would not run the system batteries continuously for such a long period of time which would cause the error to emerge. This leads to the army official not giving sufficient information as they though it was not required.

Page 11 of 17

EG2401 Project 7 Final Report Tutorial 308

Unfortunately after the Israel data was received, immediate actions was not taken by the Army officials to determine the period that the Patriot missile could operate before the inaccurate time calculation would cause the system to be ineffective. This caused the modifications being sent out to the system later than expected. Ethical issue is the major root of this disaster as it leads to a lot of other problems. Software systems can be more complicated therefore it needs very thorough review process and testing. The engineers failed to discover all bugs as although it is difficult to test all combinations of inputs or variable values. They also did not ensure that the rectification was done immediately. The complacency that the system will not run for more than 8 hours was also one of the ethical issues. This shows that software engineering principles may not be adhered properly.

2.5 - Analysis of the Ethical Issues From this incident, a few ethical issues can be discussed. Although the main cause is in the numerical error, there were a lot of events that led up to the disaster. This was hugely contributed by ethical issues. As a result, it caused a catastrophic failure involving loss of human life. Inconsistency was one of the primary factors of this mishap. When the error was discovered at Raytheon, the amendment was only made to some areas of the codes. This caused the inconsistency in the codes as they were not operating in the same function. The updated codes were running with a faster and more precise function. This program was developed fifteen years ago in assembly language. With updates and new codes added, it made the system tough to operate and maintain. Therefore the defect was not identified before the incident. In addition, complacency was another cause of the disaster. The army officials did not bother to investigate the number of hours that the system will be running. By assuming that the systems will not run for a very long time continuously caused the message received without any detailed specification of when the system will incur error.

Page 12 of 17

EG2401 Project 7 Final Report Tutorial 308

Another factor that contributed to the software engineering ethical issues is maintainability. It is very important to maintain an old system. In 1986, when the Patriot Missile System was modified to enhance missile tracking, the developers chose to recode the software in a high-level language. This would improve the system performance and maintainability. The fifteen year old assembly code that constantly patched was too complicated for the programmers to understand. This caused them the inability to do a better job when dealing with the system. The checks for a critical safety system should be done comprehensively. It should have been detected through vigorous debugging and unit testing, though it can be quite intricate and expensive. The problem could also be identified if retesting is done after every update or patching. Reaction time taken by the army officials to rectify the software problems was not fast enough. The officials should have reacted with much prompt and effectiveness. This delay in the response leads to the disaster as the message was not received on time. An ethically responsible software engineers would have taken the matter seriously and reacted accordingly to prevent this mishap. In addition to the bad timing of the bug fixs late arrival, it causes the loss of life in Dhahran. Twenty-eight soldiers lost their lives while serving their country. Had better care been taken to conform to ethical standards, their deaths might have been avoided.

2.6 Recommendations Ethical Responsibilities should be instilled in all individuals dealing with the system software. They should avoid falsifying test results or covering up known or suspected problems. Leaving potentially serious bug in a product and violating safety standards. When adapting an older software system to a new user, the new user should understand the whole system before operating on them. There must be sufficient analysis of potential failure modes and consequences. To design and implement appropriate safety and robustness features. They should have reacted aggressively to reports and warnings after the possible problems. Engineers and management should not underestimate the complexity of the software. Poor software engineering techniques should not be use given that there was
Page 13 of 17

EG2401 Project 7 Final Report Tutorial 308

little to no documentation available regarding the developments of the software. The management should react faster to communicate the problems with other users. Most importantly protect against error and not error discovery. In making software engineering a beneficial and respected profession, software engineers shall commit to making design, analysis, testing, specification, maintenance and development. Software engineers shall follow these following Eight Principles, in accordance to their commitment to the welfare, health and safety of the public: 1. PUBLIC - Software engineers shall act consistently with the public interest. 2. CLIENT AND EMPLOYER - Software engineers shall act in a manner that is in the best interests of their client and employer consistent with the public interest. 3. PRODUCT - Software engineers shall ensure that their products and related modifications meet the highest professional standards possible. 4. JUDGMENT - Software engineers shall maintain integrity and independence in their professional judgment. 5. MANAGEMENT - Software engineering managers and leaders shall subscribe to and promote an ethical approach to the management of software development and maintenance. 6. PROFESSION - Software engineers shall advance the integrity and reputation of the profession consistent with the public interest. 7. COLLEAGUES - Software engineers shall be fair to and supportive of their colleagues. 8. SELF - Software engineers shall participate in lifelong learning regarding the practice of their profession and shall promote an ethical approach to the practice of the profession. The role and responsibility of a software Engineer is to act professionally and ethically by understanding and appreciating risk factors. They are to ensure proper attention is paid to these principles

Page 14 of 17

EG2401 Project 7 Final Report Tutorial 308

2.7 - Current Practices Current situation for the Patriot Missile would be to restart the system for 2 to 4 hours. Turning off and on the system helps the system to refresh itself and reinitialize the computer time back to zero. To reboot the system it would take only less than a minute. Another immediate action taken by the Army would be to provide advanced training for the Patriots maintenance operator. Raytheon has developed several training to take the highly experienced ones to a much higher level of maintenance operator. Every update in the software or upgrades is being taught to the maintenance operators. Thus in time, there would be many highly trained and skilled soldiers. Lastly though still under prototype, conventional radar has been replaced by an xray radar system. Unlike the conventional radar, it do not use electrical pulse or radar waves to detect a missile but instead uses x-rays that is much more superior. Hence, the name x-rays radar system, Therac-12. 2.8 - Conclusion In conclusion, we realized how much people these days rely on such software or computers to help us protect ourselves or even help to save lives. The responsibilities placed on such small software or computer is so great, thus needed a great care on the software or computers. It is important to do regular maintenance and checks on the system that everything is running fine and nothing peculiar in the software that can result in the malfunction or the ability to not function properly. Due to the reason that software is very prone to bugs and viruses, it is also important to debug the software and do modifications or updates on the system. For example, how Apple or Microsoft conducts regular updates on their software, and if there were any bugs or viruses they would send updates or upgrades for downloading.

Page 15 of 17

EG2401 Project 7 Final Report Tutorial 308

References Bibliography
(n.d.). Retrieved from http://www.engineering.uiowa.edu/~ece_036/Lecture/Ethics.pdf Arnold, D. N. (1996, September 24). Two disasters caused by computer arithmetic errors. Aviation Week and Space Technology. (1996, September 12). Ariane 5 Report Details Software Design Errors. LIONS, P. J. (1996, July 19). Flight 501 Failure. Retrieved March 1, 2011, from http://www.di.unito.it/~damiani/ariane5rep.html Lum, A. (n.d.). Patriot Missile Software Problem. Retrieved from http://sydney.edu.au/engineering/it/~alum/patriot_bug.html Math News. (2003, October 17). New anti-missile system declared worst engineering disaster EVER . Retrieved from http://www.mathnews.uwaterloo.ca/Issues/mn9303/arienne.php Raytheon Company. (n.d.). Raytheon. Retrieved from "Gulf War" : http://www.pbs.org/wgbh/pages/frontline/gulf/weapons/raytheontext.html The Importance of Ethics in Mission and Safety Critical Software Engineering. (n.d.). Retrieved from http://cseserv.engr.scu.edu/StudentWebPages/uletran/uletran_FinalPaper.htm

Page 16 of 17