Contents Overview..................................................................................................................22 Goal ..........................................................................................................................22 What Is a Disaster?...................................................................................................22 Why Plan?................................................................................................................23 Benefits of Proper Planning ......................................................................................23 Planning...................................................................................................................24 Creating a Plan .........................................................................................................24 What Are the Business Requirements for Disaster Recovery?................................24 When Should a Disaster Recovery Procedure Begin?.............................................25 Expected Downtime/Recovery Time.........................................................................25 Recovery Group and Staffing....................................................................................26 Types of Disaster Recovery......................................................................................27 Disaster Scenarios....................................................................................................27 Three Common Disaster Scenarios..........................................................................28 Recovery Script.........................................................................................................29 Creating a Recovery Script .....................................................................................210 Recovery Process...................................................................................................210 Crash Kit .................................................................................................................211 Business Continuation During Recovery ................................................................214 Offsite Disaster Recovery Sites ..............................................................................214 Integration with your Companys General Disaster Planning .................................214 When the R/3 System Returns ...............................................................................214 Test your Disaster Recovery Procedure ............................................................215 Other Considerations ...........................................................................................216 Other Up- or Downstream Applications ..................................................................216 Backup Sites ...........................................................................................................216 Minimizing the Chances for a Disaster...............................................................217 Minimize Human Error ............................................................................................217 Minimize Single Points of Failure............................................................................217 Cascade Failures ....................................................................................................218
Chapter 2: Disaster Recovery
Overview Release 4.0B 22 Overview The purpose of this chapter is to help you understand what we feel is the most critical job of a system administratordisaster recovery. We have chosen to include this chapter at the beginning of our guidebook for two reasons: < To emphasize the importance of the subject Disaster recovery needs to be planned as soon as possible, because it will take time to develop, test, and refine. < To emphasize the importance of being well prepared in the event of a potential disaster Murphys Law says: Disaster will strike when you are not prepared for it. The faster you begin planning, the more prepared you will be when it happens. This chapter is not a disaster recovery how to. It is only a start to get you thinking and working on disaster recovery. Goal The goal of disaster recovery is to restore the system so that the company can continue doing business. What s a Disaster? A disaster is anything that results in the corruption or loss of the R/3 System. Examples include: < Database corruption. For example when test data is accidentally loaded into the production system. This type of corruption happens more often than people realize. < A serious hardware failure. < A complete loss of the R/3 System and infrastructure. For example, the destruction of the building due to natural disaster. The ultimate responsibility of a system administrator is to successfully restore R/3 after a disaster. The ultimate consequence of not restoring the system is that your company goes out of business. Chapter 2: Disaster Recovery Why Plan? R/3 System Administration Made Easy 23 The administrators goal is to prevent the system from ever reaching the situation where the ultimate responsibility is called upon. Disaster recovery planning is a major project in itself. Depending on your situation and the size and complexity of your company, disaster recovery planning could take more than a year to prepare, test, and refine. The plan could fill many volumes. This chapter helps you start thinking about and planning for disaster recovery. Why Plan? A system administrator should expect and plan for the worst, and then hope for the best. During a disaster recovery, nothing should be done for the first time. Unpleasant surprises could be fatal to the recovery process. Here are some of the reasons to develop a disaster recovery plan: < Will business operations stop if R/3 fails? < How much lost revenue and cost will be incurred for each hour that the system is down? < Which critical business functions cannot be completed? < How will customers be supported? < How long can the system be down before the company goes out of business? < Whoif anyoneis coordinating and managing the disaster recovery? < What will the users do while R/3 is down? < How long will the system be down? < How long will it take before the R/3 System is back up? Benefits of Proper Planning You will be under less stress, because you know that the system can be recovered and how long this recovery will take. If the recovery downtime is unacceptable, management should invest in: < Equipment, facilities, and personnel < High availability (HA) options HA options can be expensive. There are different degrees of HA, so customers need to determine which option is right for them. HA is an advanced topic beyond the scope of this guidebook. If you are interested in this topic, contact a HA vendor. Chapter 2: Disaster Recovery Planning Release 4.0B 24 Planning Creating a Plan Creating a disaster recovery plan is a major project because: < It can take over a year and considerable time to develop, test, and document. < The documentation may be extensive (literally thousands of pages long). If you do not know how to plan for a disaster recovery, get the assistance of an expert. A bad plan (that will fail) is worse than no plan, because it provides a false sense of security. What Are the Business Requirements for Disaster Recovery? Who will provide the requirements? < Senior management needs to provide global or strategic requirements and guidelines. < The business units needs drive the specific detailed requirements. These units should understand that as the requirement for the recovery time decreases, the cost for disaster recovery increases. The units should budget for it, or if the funds come from an administrative or IT budget, the units should support it. What are the requirements? Each requirement should answer the following questions: < Who is the requestor? < Are other departments or customers affected by this requirement? < What is the requirement? < Why is the requirement necessary? When R/3 is offline, what does (or does not) happen? What is the cost (or lost revenue) of an hour or a day of R/3 downtime? The justification should be a concrete objective value (such as $20,000 an hour). Define the cost (per hour, per day, and so on) of having the R/3 System down. Some examples: < Example 1 What: No more than one hour of transaction data may be lost. Why: The cost is 1,000 transactions per hour of lost transactions that are entered into R/3 and cannot be recreated from memory. This inability to recreate lost transactions may result in lost sales and upset customers. This situation can be critical if the lost orders are those that the customer quickly needs. < Example 2 What: Cannot be offline for more than three hours. Why: The cost (an average of $25,000 per hour) is the inability to book sales. Chapter 2: Disaster Recovery Planning R/3 System Administration Made Easy 25 < Example 3 What: In the event of disaster, such as the loss of the building containing the R/3 data center, the company can only tolerate a two-day downtime. Why: At that point, permanent customer loss begins. Other: There must be an alternate method of continuing business. When 8hould a Disaster Recovery Procedure Begin? Ask yourself the following questions: < What criteria constitute a disaster? Have these criteria been met? < Who needs to be consulted? Someone has to make the final decision. This person must be aware of the effect of the disaster on the companys business, and the importance and critical nature of the recovery. Expected Downtime/Recovery Time Expected Downtime Expected downtime is only part of the business cost of disaster recovery. For defined scenarios, this cost is the expected minimum time before R/3 can be productive again. For the company, downtime may mean that no orders can be processed and no products shipped. Management must approve this cost, so it is important that they understand and accept downtimes as potential business costs. It is important to find out if there are alternate processes that can be used while the R/3 System is being recovered, so business may continue. The following costs are involved with downtimes: < The longer the system is down, the longer the catch-up period when it is brought back up. The transactions from the alternate processes that were in place during the disaster have to be applied to the system to make it current. This situation is more critical in a high- volume environment. < The cost of a downed system is higher during the business day, when business could stop than at the end of the business day, when everyone has gone home. < Customers may be lost when they cannot be serviced and supported. The duration of acceptable downtime depends on the company and the nature of its business. Recovery Time Unless you test your recovery procedure, the recovery time is only an estimate, or worse, a guess. Different disaster scenarios have different recovery times based on what needs to be done to recover the system and become operational again. Chapter 2: Disaster Recovery Planning Release 4.0B 26 The time to recover must be matched to the business requirements. If the recovery time is greater than the business requirements, this mismatch needs to be communicated to the appropriate managers or executives for resolution. This resolution may include: < Investing in equipment, processes, and facilities to reduce the recovery time < Changing the business requirements and accepting the consequences An extreme (but possible) example: A company cannot afford the cost and lost revenue for the one month it would take for one person to recover the system. During that time, the competition would take away customers, payment would be due to vendors, and bills could not be collected. In such a situation, senior management needs to allocate resources to reduce the recovery time to an acceptable level. Recovery Group and 8taffing To adequately staff a recovery group, the following staffing considerations should be made: < One person should be responsible for managing the entire recovery. All recovery activities should be coordinated with this person. < One person should handle incoming user calls, and keep top management updated with the status of the recovery. Having one person handling incoming calls allows the person (or group) doing the technical recovery to do so without being interrupted. To reduce interruption of the recovery staff, we recommend that you post a status board listing the status of key points in the recovery plan, and an estimate of when the system will be recovered and available to use. < One person should handle or coordinate the technical recovery. As things progress, the original plan may have to be changed. It is key that one person know what is going on to manage any changes that are needed, to manage and coordinate the technical recovery. < One person should coordinate and plan the post-recovery testing and certification with users. < The staffing plan should allow for one person (or more) to be unavailable. The remaining group should be able to perform a successful recovery. Keep in mind the following: < If the disaster is a major geographical/regional event (for example, an earthquake), your local staff will probably be more concerned with their familiesnot the company. < Depending on the disaster, key personnel could be injured or killed. You should expect and plan for the above situations. Plan for staff from other geographic sites to be flown in, that participate as members of the disaster recovery team. Chapter 2: Disaster Recovery Planning R/3 System Administration Made Easy 27 Types of Disaster Recovery Disaster recovery scenarios can be grouped into two main groups, onsite and offsite recovery. It is important to differentiate between the two because each dictates separate disaster scenarios. Onsite Onsite recovery is disaster recovery done at your site. The best case scenario is a recovery done on the original hardware. The worst case scenario is a recovery done on a backup system. Infrastructure usually remains intact. Offsite Offsite recovery is disaster recovery done at a disaster recovery site. In this scenario, all hardware and infrastructure are lost due to destruction of the facility (such as a fire) or a major natural disaster (such as a flood or earthquake). You will need to configure the new servers from scratch. A major consideration with an offsite recovery is the recovery back to the customers facility, once the facility has been rebuilt and tested. This move is similar to a disaster recovery, only in reverse. The backup of the database and related files from the disaster site is restored at the customers site. The timing here is just as critical as it is in a disaster. While the system is being restored, it is down. Disaster 8cenarios There could be an infinite number of scenarios that take an infinite amount of time for which to plan. To make this task manageable, you should plan for three to no more than five scenarios to be used as the basis for the actual recovery. In the event of a disaster, you would adapt the closest scenario(s) to the actual disaster. Therefore, a wide range of scenarios is needed. To create your scenarios: 1. Use the Three Common Disaster Scenarios section below as a base. 2. Prepare three to five scenarios applicable to your company that cover a wide range of possibilities. 3. Create a high level plan (of major tasks) for each scenario. 4. Test the planned scenario by creating different test disaster scenarios and determining if (and how) your scenario plans would be adapted to an actual disaster. 5. If the test scenario cannot be handled, then it should be planned for by adding to or changing one of your scenarios. Once this process is complete, your detailed planning should be based on high-level plans. Chapter 2: Disaster Recovery Planning Release 4.0B 28 Three Common Disaster 8cenarios The following three examples range from a best-to-worst scenario order: The sample downtimes are only examples showing the difference in downtimes between scenarios. They are only real for the one specific environment that they were derived from. Your downtimes will be different. You must replace the sample downtimes with the downtimes applicable to your own environment. A Corrupt Database < During this disaster event, the database is corrupted. < Examples: Accidentally loading test data into the production system. A bad transport into production, resulting in the failure of the production system. < Such a disaster requires the recovery of R/3 database and related operating system files. < The sample downtime is eight hours. A Hardware Failure < During this disaster event, there is a major hardware failure < Examples: Failure of a system processor Failure of a drive controller Failure of multiple-drives in a drive array, so that the drive array fails < Such a disaster scenario requires: Replacement of failed hardware Rebuilding the server (operating system and all programs) Recovering the R/3 database and related files < The sample downtime is seven days, and constitutes: Five days to procure replacement hardware Two days to rebuild the NT server (one person); 16 hours of actual work time A Complete Loss or Destruction of the 8erver Facility < During this disaster event, the following items are lost: Servers All supporting infrastructure All documentation and materials in the building (and possibly the building) < Examples: Fire, earthquake, flood, hurricane, or tornado The World Trade Center bombing < Such a disaster requires: Replacement facilities Replacement of infrastructure Replacement of lost hardware Chapter 2: Disaster Recovery Planning R/3 System Administration Made Easy 29 Rebuilding the server and R/3 environment (hardware, operating system, database, etc.) Recovering the R/3 database and related files < The sample downtime lasts eight days and constitutes: At least five days to procure hardware. In a regional disaster, this could take longer if your suppliers were also affected by the disaster. Use national vendors with several regional distribution centers, or have an out-of- area alternate supplier. Two days to rebuild the NT server (one person); 16 hours actual work time As the hardware is procured and the server is being rebuilt, an alternate facility is obtained and an emergency (minimal) network is constructed One day to integrate into the emergency network < This scenario also requires a recovery back to a rebuilt facility, at a future time. Recovery 8cript What A recovery script is a document that provides step-by-step instructions about: < The process required to recover R/3 < Who will complete each step < The expected time for steps that take a long time to run < Dependencies between steps Why A script is necessary because it helps you: < Develop and use a proven series of steps to restore R/3 < Prevent missing steps Missing a critical step may require restarting the recovery process from the beginning, which will delay the recovery. A recovery script also helps the backup person do the recovery, if the primary recovery person is not available. Chapter 2: Disaster Recovery Planning Release 4.0B 210 Creating a Recovery 8cript Creating a recovery script requires: < A checklist for each step And where necessary to clarify: < A document with screenshots to clarify the instructions < Flowcharts, if the flow of steps or activities is critical or confusing Recovery Process To reduce recovery time, define a process by: < Completing as many tasks as possible in parallel < Adding timetables for each step Ma]or 8teps 1. During a potential disaster, anticipate a recovery by: < Collecting facts < Recalling the latest offsite tapes < Recalling the crash kit < Calling all required personnel (such as including the internal SAP team, affected key users, infrastructure support, IT, facilities, on-call consultants, and so on) < Functional organizations (sales, finance, and shipping) preparing for alternate manual procedures for key business transactions and processes. 2. Minimize the effect of the problem by: < Stopping all additional transactions into the system Waiting too long could make the problem even worse < Collecting transaction records that have to be manually reentered 3. Begin the planning process by: < Analyzing the problem < Fitting the disaster to your predefined scenario plans < Modifying the plans as needed for the disaster 4. Define when to initiate a disaster recovery procedure: < What are the criteria to declare a disaster, and have they been met? < Who will make the final decision to delcare a disaster? 5. Declare the disaster. 6. Perform the recovery of the system. 7. Test and signoff on the recovered system. Key users, who will use a criteria checklist to determine that the system has been satisfactorily recovered, should perform the testing. Chapter 2: Disaster Recovery Planning R/3 System Administration Made Easy 211 8. Catch up with financial transactions that may have been handled by alternate processes during the disaster. Once completed, this step should require an additional sign-off. 9. Notify the users that the system is ready for normal operations. 10. Conduct a postmortem debriefing. Use the results from this postmortem to improve your disaster recovery planning. Crash Kit What A crash kit is a container that contains everything needed to: < Rebuild the R/3 servers < Reinstall R/3 < Recover the R/3 database and related files Why In the event of a disaster, everything that is needed to recover the R/3 environment is contained in one (or a few) containers. If you have to evacuate the site, you do not have time to run around gathering all the items at the last minute, and hope that you got everything you need. In a major disaster you may not even have that opportunity. When When a change is made to any component (hardware or software) on the server, replace the outdated items in the crash kit with updated items that have been tested. Where to Put the Crash Kit It should be physically separated from the servers. If located in the server room, and the server room is destroyed, you have lost your crash kit. < Commercial offsite data storage < Other company sites < In another secure section of the building How The following is a checklist of the items to put into the crash kit. You may need to add or delete items for your specific environment. Chapter 2: Disaster Recovery Planning Release 4.0B 212 Documentation < An inventory (list) of what is in the crash kit, signed and dated by the person checking the crash kit. The crash kit should be sealed by the person checking the kit and taking the inventory of its contents. If the seal is broken, someone may have removed or changed items from the crash kit, and possibly made it useless in a recovery. < Disaster recovery script < Installation instructions for the: Operating system Database R/3 System Others < Special installation instructions for: Drivers that have to be manually installed Programs that must be installed in a specific manner < Copies of the SAP license for all instances < Copies of service agreements (with phone numbers) for all servers Find out if maintenance agreements are still valid. Have any agreements expired? < Instructions to recall tapes from offsite data storage. < The list of personnel who are authorized to recall tapes from offsite data storage. The list of personnel must correspond to the list maintained by the data storage company. < A parts list If the server is destroyed, this list should be in sufficient detail to purchase or lease replacement hardware. Over time, original parts may no longer be available. An alternate parts list will have to be prepared when this happens. At this point, you might consider upgrading the equipment. < File system layout < Hardware layout, you need to know which: Cards go in which slots Chapter 2: Disaster Recovery Planning R/3 System Administration Made Easy 213 Cables go where (connector-by-connector ) Labeling cables and connectors greatly reduces confusion < Phone numbers for: Key users Information services personnel Facilities personnel Other infrastructure personnel Consultants SAP Hotline Offsite data storage Security department or personnel Service agreement contacts Hardware vendors 8oftware < Operating System: Installation kit Drivers for hardware, such as a Network Interface Card (NIC) or a SCSI controller which are not included in the installation kit Service packs, updates, and patches < Database: Installation kit Service packs, updates, and patches Recovery scripts, to automate the database recovery < For R/3: Installation kit Currently installed kernel System profile files tpparam file saprouttab file < Other R/3 integrated programs (for example, a tax package) < Other software for the R/3 installation: Utilities Backup UPS control program Hardware monitor FTP client Remote control program System monitor Chapter 2: Disaster Recovery Planning Release 4.0B 214 Business Continuation During Recovery What will your company do while the R/3 System is being recovered? Business continuation is an alternate process to continue doing business while recovering from a disaster. It includes: < Cash collection < Order processing < Product shipping < Bill paying < Payroll processing < Alternate location to continue business Why Without an alternate process, your company could stop doing business For example: < Orders cannot be entered < Product cannot be shipped < Money cannot be collected How There are many alternate processes, including: < Manual paper-based < Stand alone PC-based products Offsite Disaster Recovery 8ites < Other company sites < Commercial disaster recovery sites < Share or rent space from other companies ntegration with your Company's General Disaster Planning Because there are many interdependencies, the R/3 disaster recovery process must be integrated into your companys general disaster planning. This process includes telephone, network, product deliveries, mail, and so on. When the R/3 8ystem Returns How will the transactions that were handled with the alternate process be entered into R/3 when R/3 is back in production? Chapter 2: Disaster Recovery Test your Disaster Recovery Procedure R/3 System Administration Made Easy 215 Test your Disaster Recovery Procedure Do you know if you can actually recover the system? Unless you test your recovery process, you do not know if you can actually recover your system. A test is a simulated disaster recovery done to verify that you can recover the system and exercise every task outlined in the disaster recovery plan. < Test to find out if: Your disaster recovery procedure works Something changed, was not documented, or updated There are steps that need clarification for others What is clear to the person documenting the steps may be unclear to another person reading the documentation. Older hardware is no longer available If so, alternate planning needs to be done and tested. You may have to upgrade your hardware to be compatible with currently available equipment. Since many factors affect recovery time, actual recovery times can only be determined by testing. Once you have actual times (not guesses or estimates), your disaster planning becomes more credible. If the procedure is practiced often, when a disaster occurs, everyone will know what to do. This way, the chaos of a disaster will be reduced. How < Execute your disaster recovery plan on a backup system or at an offsite location. < Generate a random disaster scenario < Execute your disaster plan to see if it handles the scenario. When < A full disaster recovery should be practiced at least once a year. Where < The disaster recovery test should be done at the same site that you expect to recover. If you have multiple recovery sites, perform a test recovery at each site. The equipment, facilities, and configuration may be different from site to site. Document all specific items that need to be completed for each site. If you do not test at a particular site, you will not know if recovery is possible at that site. You do not want to discover that you cannot recover at a site after a disaster occurs. < A backup onsite server < Another company site Chapter 2: Disaster Recovery Other Considerations Release 4.0B 216 < At another company where you have a mutual support agreement < A company that provides disaster recovery site and services Who 8hould Participate < Primary and backup personnel who will do the job during a real disaster recovery A provision should be made that some of the primary personnel will be unavailable during a disaster recovery. A test procedure might be to pull a name from a hat, and declare that person unavailable to participate. This duplicates a real situation in which a primary person is seriously injured or killed. < Personnel at other sites Integrate these people into the test, since they may be needed to perform the recovery during a real disaster. These people will fill in for unavailable personnel. Other Considerations Other Up- or Downstream Applications Other up- or downstream applications also need to be recovered with R/3 for the company to function. Some of these may be very tightly associated with R/3. These applications should be accounted for and protected in the company-wide disaster recovery planning. Be careful of applications which are located only on one persons desktop computer. Backup 8ites Having a contract with a disaster recovery site does not guarantee that the site will be available. In a regional disaster, such as an earthquake or flood, many other companies will be competing for the same commercial disaster sites. In this situation, you may not have a site to recover to, if others have booked it before you. The emergency backup site may not have equipment of the same performance level as your production system. Reduced performance and transaction throughput must be considered. Examples: < A reduced batch schedule of only critical jobs. < Only essential business tasks will be done while on the recovery system. Chapter 2: Disaster Recovery Minimizing the Chances for a Disaster R/3 System Administration Made Easy 217 Minimizing the Chances for a Disaster These are many ways to minimize chances for disaster. Some of these ideas are quite obvious, but it is these obvious ones that can be forgotten. Minimize Human Error Many disasters are caused by human error, such as a mistake or a tired operator. < Potentially damaging tasks should be scripted, and checkpoints should be included to verify what is being done before doing dangerous tasks such as: Deleting the test database Check that the delete command specifies the Test database, not the Production database. Moving a file Verify that the target file (to be overwritten) is the old file, not the new file. Formatting a new drive Verify that the new drive will be formatted, not an existing drive with data on it. < When you are tired, get a second opinion before doing something potentially dangerous. Do not attempt dangerous tasks when you are tired. Minimize 8ingle Points of Failure A single-point failure is when the failure of a single component causes the entire system to fail or crash. To minimize single point-failure: < Identify conditions where a single-point failure can occur < Anticipate what will happen if this component or process fails < Eliminate as many of these single points of failure as practical. Practical is defined by the amount of work or cost compared to the risk of failure. Examples: < The backup R/3 server is located in the same data center as the production R/3 server. If the data center is destroyed, the backup server is also destroyed. < All the R/3 servers are on a single electrical circuit. If the circuit breaker opens, everything on that circuit loses power, and all the servers will crash. Chapter 2: Disaster Recovery Minimizing the Chances for a Disaster Release 4.0B 218 Cascade Failures A cascade failure occurs when one failure triggers additional failures, which increases the complexity of a problem. The recovery involves the coordinated fixing of many problems. For example note the following progression of events: 1. A power failure in the air conditioning system causes an environmental (air conditioning) failure in the server room. 2. Without cooling, the temperature in the server room rises above the equipments acceptable operating temperature. 3. The overheating causes a hardware failure in the server. 4. The hardware failure causes a database corruption. In addition, overheating can damage many things, so you may not know what else was damaged, such as: Network equipment Phone system Other servers The recovery becomes complex because: < Fixing one problem may uncover other problems or damaged equipment. < Dependencies where certain things cannot be tested or fixed until other equipment are operational. In this case, a system that monitors the air conditioning system and/or the temperature in the server room could alert the appropriate people before the temperature in the server room becomes too hot.