Sie sind auf Seite 1von 18

System Administration Made Easy 21

Chapter 2: Disaster Recovery


Contents
Overview..................................................................................................................22
Goal ..........................................................................................................................22
What Is a Disaster?...................................................................................................22
Why Plan?................................................................................................................23
Benefits of Proper Planning ......................................................................................23
Planning...................................................................................................................24
Creating a Plan .........................................................................................................24
What Are the Business Requirements for Disaster Recovery?................................24
When Should a Disaster Recovery Procedure Begin?.............................................25
Expected Downtime/Recovery Time.........................................................................25
Recovery Group and Staffing....................................................................................26
Types of Disaster Recovery......................................................................................27
Disaster Scenarios....................................................................................................27
Three Common Disaster Scenarios..........................................................................28
Recovery Script.........................................................................................................29
Creating a Recovery Script .....................................................................................210
Recovery Process...................................................................................................210
Crash Kit .................................................................................................................211
Business Continuation During Recovery ................................................................214
Offsite Disaster Recovery Sites ..............................................................................214
Integration with your Companys General Disaster Planning .................................214
When the R/3 System Returns ...............................................................................214
Test your Disaster Recovery Procedure ............................................................215
Other Considerations ...........................................................................................216
Other Up- or Downstream Applications ..................................................................216
Backup Sites ...........................................................................................................216
Minimizing the Chances for a Disaster...............................................................217
Minimize Human Error ............................................................................................217
Minimize Single Points of Failure............................................................................217
Cascade Failures ....................................................................................................218

Chapter 2: Disaster Recovery


Overview
Release 4.0B
22
Overview
The purpose of this chapter is to help you understand what we feel is the most critical job of
a system administratordisaster recovery.
We have chosen to include this chapter at the beginning of our guidebook for two reasons:
< To emphasize the importance of the subject
Disaster recovery needs to be planned as soon as possible, because it will take time to
develop, test, and refine.
< To emphasize the importance of being well prepared in the event of a potential disaster
Murphys Law says:
Disaster will strike when you are not prepared for it.
The faster you begin planning, the more prepared you will be when it happens.
This chapter is not a disaster recovery how to. It is only a start to get you thinking and
working on disaster recovery.
Goal
The goal of disaster recovery is to restore the system so that the company can continue
doing business.
What s a Disaster?
A disaster is anything that results in the corruption or loss of the R/3 System.
Examples include:
< Database corruption.
For example when test data is accidentally loaded into the production system.
This type of corruption happens more often than people realize.
< A serious hardware failure.
< A complete loss of the R/3 System and infrastructure.
For example, the destruction of the building due to natural disaster.
The ultimate responsibility of a system administrator is to successfully restore R/3 after a
disaster.
The ultimate consequence of not restoring the system is that your company goes out of
business.
Chapter 2: Disaster Recovery
Why Plan?
R/3 System Administration Made Easy
23
The administrators goal is to prevent the system from ever reaching the situation where the
ultimate responsibility is called upon.
Disaster recovery planning is a major project in itself. Depending on your situation and the
size and complexity of your company, disaster recovery planning could take more than a
year to prepare, test, and refine. The plan could fill many volumes. This chapter helps you
start thinking about and planning for disaster recovery.
Why Plan?
A system administrator should expect and plan for the worst, and then hope for the best.
During a disaster recovery, nothing should be done for the first time. Unpleasant surprises
could be fatal to the recovery process.
Here are some of the reasons to develop a disaster recovery plan:
< Will business operations stop if R/3 fails?
< How much lost revenue and cost will be incurred for each hour that the system is down?
< Which critical business functions cannot be completed?
< How will customers be supported?
< How long can the system be down before the company goes out of business?
< Whoif anyoneis coordinating and managing the disaster recovery?
< What will the users do while R/3 is down?
< How long will the system be down?
< How long will it take before the R/3 System is back up?
Benefits of Proper Planning
You will be under less stress, because you know that the system can be recovered
and how long this recovery will take.
If the recovery downtime is unacceptable, management should invest in:
< Equipment, facilities, and personnel
< High availability (HA) options
HA options can be expensive. There are different degrees of HA, so customers need to
determine which option is right for them.
HA is an advanced topic beyond the scope of this guidebook. If you are interested in this
topic, contact a HA vendor.
Chapter 2: Disaster Recovery
Planning
Release 4.0B
24
Planning
Creating a Plan
Creating a disaster recovery plan is a major project because:
< It can take over a year and considerable time to develop, test, and document.
< The documentation may be extensive (literally thousands of pages long).
If you do not know how to plan for a disaster recovery, get the assistance of an expert. A
bad plan (that will fail) is worse than no plan, because it provides a false sense of security.
What Are the Business Requirements for Disaster Recovery?
Who will provide the requirements?
< Senior management needs to provide global or strategic requirements and guidelines.
< The business units needs drive the specific detailed requirements.
These units should understand that as the requirement for the recovery time decreases,
the cost for disaster recovery increases. The units should budget for it, or if the funds
come from an administrative or IT budget, the units should support it.
What are the requirements?
Each requirement should answer the following questions:
< Who is the requestor?
< Are other departments or customers affected by this requirement?
< What is the requirement?
< Why is the requirement necessary?
When R/3 is offline, what does (or does not) happen?
What is the cost (or lost revenue) of an hour or a day of R/3 downtime?
The justification should be a concrete objective value (such as $20,000 an hour). Define
the cost (per hour, per day, and so on) of having the R/3 System down.
Some examples:
< Example 1
What: No more than one hour of transaction data may be lost.
Why: The cost is 1,000 transactions per hour of lost transactions that are entered into
R/3 and cannot be recreated from memory.
This inability to recreate lost transactions may result in lost sales and upset
customers. This situation can be critical if the lost orders are those that the customer
quickly needs.
< Example 2
What: Cannot be offline for more than three hours.
Why: The cost (an average of $25,000 per hour) is the inability to book sales.
Chapter 2: Disaster Recovery
Planning
R/3 System Administration Made Easy
25
< Example 3
What: In the event of disaster, such as the loss of the building containing the R/3
data center, the company can only tolerate a two-day downtime.
Why: At that point, permanent customer loss begins.
Other: There must be an alternate method of continuing business.
When 8hould a Disaster Recovery Procedure Begin?
Ask yourself the following questions:
< What criteria constitute a disaster?
Have these criteria been met?
< Who needs to be consulted?
Someone has to make the final decision. This person must be aware of the effect of the
disaster on the companys business, and the importance and critical nature of the recovery.
Expected Downtime/Recovery Time
Expected Downtime
Expected downtime is only part of the business cost of disaster recovery. For defined
scenarios, this cost is the expected minimum time before R/3 can be productive again. For
the company, downtime may mean that no orders can be processed and no products
shipped. Management must approve this cost, so it is important that they understand and
accept downtimes as potential business costs.
It is important to find out if there are alternate processes that can be used while the R/3
System is being recovered, so business may continue.
The following costs are involved with downtimes:
< The longer the system is down, the longer the catch-up period when it is brought back
up.
The transactions from the alternate processes that were in place during the disaster have
to be applied to the system to make it current. This situation is more critical in a high-
volume environment.
< The cost of a downed system is higher during the business day, when business could
stop than at the end of the business day, when everyone has gone home.
< Customers may be lost when they cannot be serviced and supported.
The duration of acceptable downtime depends on the company and the nature of its
business.
Recovery Time
Unless you test your recovery procedure, the recovery time is only an estimate, or worse, a
guess. Different disaster scenarios have different recovery times based on what needs to be
done to recover the system and become operational again.
Chapter 2: Disaster Recovery
Planning
Release 4.0B
26
The time to recover must be matched to the business requirements. If the recovery time is
greater than the business requirements, this mismatch needs to be communicated to the
appropriate managers or executives for resolution.
This resolution may include:
< Investing in equipment, processes, and facilities to reduce the recovery time
< Changing the business requirements and accepting the consequences
An extreme (but possible) example: A company cannot afford the cost and lost revenue for
the one month it would take for one person to recover the system. During that time, the
competition would take away customers, payment would be due to vendors, and bills could
not be collected.
In such a situation, senior management needs to allocate resources to reduce the recovery
time to an acceptable level.
Recovery Group and 8taffing
To adequately staff a recovery group, the following staffing considerations should be made:
< One person should be responsible for managing the entire recovery.
All recovery activities should be coordinated with this person.
< One person should handle incoming user calls, and keep top management updated with
the status of the recovery.
Having one person handling incoming calls allows the person (or group) doing the
technical recovery to do so without being interrupted.
To reduce interruption of the recovery staff, we recommend that you post a status board
listing the status of key points in the recovery plan, and an estimate of when the system
will be recovered and available to use.
< One person should handle or coordinate the technical recovery.
As things progress, the original plan may have to be changed. It is key that one person
know what is going on to manage any changes that are needed, to manage and
coordinate the technical recovery.
< One person should coordinate and plan the post-recovery testing and certification with
users.
< The staffing plan should allow for one person (or more) to be unavailable. The
remaining group should be able to perform a successful recovery.
Keep in mind the following:
< If the disaster is a major geographical/regional event (for example, an earthquake),
your local staff will probably be more concerned with their familiesnot the company.
< Depending on the disaster, key personnel could be injured or killed.
You should expect and plan for the above situations. Plan for staff from other geographic
sites to be flown in, that participate as members of the disaster recovery team.
Chapter 2: Disaster Recovery
Planning
R/3 System Administration Made Easy
27
Types of Disaster Recovery
Disaster recovery scenarios can be grouped into two main groups, onsite and offsite
recovery. It is important to differentiate between the two because each dictates separate
disaster scenarios.
Onsite
Onsite recovery is disaster recovery done at your site. The best case scenario is a recovery
done on the original hardware. The worst case scenario is a recovery done on a backup
system. Infrastructure usually remains intact.
Offsite
Offsite recovery is disaster recovery done at a disaster recovery site. In this scenario, all
hardware and infrastructure are lost due to destruction of the facility (such as a fire) or a
major natural disaster (such as a flood or earthquake). You will need to configure the new
servers from scratch.
A major consideration with an offsite recovery is the recovery back to the customers
facility, once the facility has been rebuilt and tested. This move is similar to a disaster
recovery, only in reverse. The backup of the database and related files from the disaster site
is restored at the customers site. The timing here is just as critical as it is in a disaster. While
the system is being restored, it is down.
Disaster 8cenarios
There could be an infinite number of scenarios that take an infinite amount of time for
which to plan. To make this task manageable, you should plan for three to no more than five
scenarios to be used as the basis for the actual recovery. In the event of a disaster, you
would adapt the closest scenario(s) to the actual disaster. Therefore, a wide range of
scenarios is needed.
To create your scenarios:
1. Use the Three Common Disaster Scenarios section below as a base.
2. Prepare three to five scenarios applicable to your company that cover a wide range of
possibilities.
3. Create a high level plan (of major tasks) for each scenario.
4. Test the planned scenario by creating different test disaster scenarios and determining if
(and how) your scenario plans would be adapted to an actual disaster.
5. If the test scenario cannot be handled, then it should be planned for by adding to or
changing one of your scenarios.
Once this process is complete, your detailed planning should be based on high-level plans.
Chapter 2: Disaster Recovery
Planning
Release 4.0B
28
Three Common Disaster 8cenarios
The following three examples range from a best-to-worst scenario order:
The sample downtimes are only examples showing the difference in downtimes between
scenarios. They are only real for the one specific environment that they were derived from.
Your downtimes will be different. You must replace the sample downtimes with the
downtimes applicable to your own environment.
A Corrupt Database
< During this disaster event, the database is corrupted.
< Examples:
Accidentally loading test data into the production system.
A bad transport into production, resulting in the failure of the production system.
< Such a disaster requires the recovery of R/3 database and related operating system files.
< The sample downtime is eight hours.
A Hardware Failure
< During this disaster event, there is a major hardware failure
< Examples:
Failure of a system processor
Failure of a drive controller
Failure of multiple-drives in a drive array, so that the drive array fails
< Such a disaster scenario requires:
Replacement of failed hardware
Rebuilding the server (operating system and all programs)
Recovering the R/3 database and related files
< The sample downtime is seven days, and constitutes:
Five days to procure replacement hardware
Two days to rebuild the NT server (one person); 16 hours of actual work time
A Complete Loss or Destruction of the 8erver Facility
< During this disaster event, the following items are lost:
Servers
All supporting infrastructure
All documentation and materials in the building (and possibly the building)
< Examples:
Fire, earthquake, flood, hurricane, or tornado
The World Trade Center bombing
< Such a disaster requires:
Replacement facilities
Replacement of infrastructure
Replacement of lost hardware
Chapter 2: Disaster Recovery
Planning
R/3 System Administration Made Easy
29
Rebuilding the server and R/3 environment (hardware, operating system, database,
etc.)
Recovering the R/3 database and related files
< The sample downtime lasts eight days and constitutes:
At least five days to procure hardware.
In a regional disaster, this could take longer if your suppliers were also affected by
the disaster.
Use national vendors with several regional distribution centers, or have an out-of-
area alternate supplier.
Two days to rebuild the NT server (one person); 16 hours actual work time
As the hardware is procured and the server is being rebuilt, an alternate facility is
obtained and an emergency (minimal) network is constructed
One day to integrate into the emergency network
< This scenario also requires a recovery back to a rebuilt facility, at a future time.
Recovery 8cript
What
A recovery script is a document that provides step-by-step instructions about:
< The process required to recover R/3
< Who will complete each step
< The expected time for steps that take a long time to run
< Dependencies between steps
Why
A script is necessary because it helps you:
< Develop and use a proven series of steps to restore R/3
< Prevent missing steps
Missing a critical step may require restarting the recovery process from the beginning,
which will delay the recovery.
A recovery script also helps the backup person do the recovery, if the primary recovery
person is not available.
Chapter 2: Disaster Recovery
Planning
Release 4.0B
210
Creating a Recovery 8cript
Creating a recovery script requires:
< A checklist for each step
And where necessary to clarify:
< A document with screenshots to clarify the instructions
< Flowcharts, if the flow of steps or activities is critical or confusing
Recovery Process
To reduce recovery time, define a process by:
< Completing as many tasks as possible in parallel
< Adding timetables for each step
Ma]or 8teps
1. During a potential disaster, anticipate a recovery by:
< Collecting facts
< Recalling the latest offsite tapes
< Recalling the crash kit
< Calling all required personnel (such as including the internal SAP team, affected key
users, infrastructure support, IT, facilities, on-call consultants, and so on)
< Functional organizations (sales, finance, and shipping) preparing for alternate
manual procedures for key business transactions and processes.
2. Minimize the effect of the problem by:
< Stopping all additional transactions into the system
Waiting too long could make the problem even worse
< Collecting transaction records that have to be manually reentered
3. Begin the planning process by:
< Analyzing the problem
< Fitting the disaster to your predefined scenario plans
< Modifying the plans as needed for the disaster
4. Define when to initiate a disaster recovery procedure:
< What are the criteria to declare a disaster, and have they been met?
< Who will make the final decision to delcare a disaster?
5. Declare the disaster.
6. Perform the recovery of the system.
7. Test and signoff on the recovered system.
Key users, who will use a criteria checklist to determine that the system has been
satisfactorily recovered, should perform the testing.
Chapter 2: Disaster Recovery
Planning
R/3 System Administration Made Easy
211
8. Catch up with financial transactions that may have been handled by alternate processes
during the disaster.
Once completed, this step should require an additional sign-off.
9. Notify the users that the system is ready for normal operations.
10. Conduct a postmortem debriefing.
Use the results from this postmortem to improve your disaster recovery planning.
Crash Kit
What
A crash kit is a container that contains everything needed to:
< Rebuild the R/3 servers
< Reinstall R/3
< Recover the R/3 database and related files
Why
In the event of a disaster, everything that is needed to recover the R/3 environment is
contained in one (or a few) containers.
If you have to evacuate the site, you do not have time to run around gathering all the
items at the last minute, and hope that you got everything you need. In a major disaster you
may not even have that opportunity.
When
When a change is made to any component (hardware or software) on the server, replace the
outdated items in the crash kit with updated items that have been tested.
Where to Put the Crash Kit
It should be physically separated from the servers.
If located in the server room, and the server room is destroyed, you have lost
your crash kit.
< Commercial offsite data storage
< Other company sites
< In another secure section of the building
How
The following is a checklist of the items to put into the crash kit. You may need to add or
delete items for your specific environment.
Chapter 2: Disaster Recovery
Planning
Release 4.0B
212
Documentation
< An inventory (list) of what is in the crash kit, signed and dated by the person
checking the crash kit.
The crash kit should be sealed by the person checking the kit and taking the inventory of
its contents. If the seal is broken, someone may have removed or changed items from the
crash kit, and possibly made it useless in a recovery.
< Disaster recovery script
< Installation instructions for the:
Operating system
Database
R/3 System
Others
< Special installation instructions for:
Drivers that have to be manually installed
Programs that must be installed in a specific manner
< Copies of the SAP license for all instances
< Copies of service agreements (with phone numbers) for all servers
Find out if maintenance agreements are still valid. Have any agreements expired?
< Instructions to recall tapes from offsite data storage.
< The list of personnel who are authorized to recall tapes from offsite data storage.
The list of personnel must correspond to the list maintained by the data storage
company.
< A parts list
If the server is destroyed, this list should be in sufficient detail to purchase or lease
replacement hardware.
Over time, original parts may no longer be available. An alternate parts list will have to
be prepared when this happens. At this point, you might consider upgrading the
equipment.
< File system layout
< Hardware layout, you need to know which:
Cards go in which slots
Chapter 2: Disaster Recovery
Planning
R/3 System Administration Made Easy
213
Cables go where (connector-by-connector )
Labeling cables and connectors greatly reduces confusion
< Phone numbers for:
Key users
Information services personnel
Facilities personnel
Other infrastructure personnel
Consultants
SAP Hotline
Offsite data storage
Security department or personnel
Service agreement contacts
Hardware vendors
8oftware
< Operating System:
Installation kit
Drivers for hardware, such as a Network Interface Card (NIC) or a SCSI
controller which are not included in the installation kit
Service packs, updates, and patches
< Database:
Installation kit
Service packs, updates, and patches
Recovery scripts, to automate the database recovery
< For R/3:
Installation kit
Currently installed kernel
System profile files
tpparam file
saprouttab file
< Other R/3 integrated programs (for example, a tax package)
< Other software for the R/3 installation:
Utilities
Backup
UPS control program
Hardware monitor
FTP client
Remote control program
System monitor
Chapter 2: Disaster Recovery
Planning
Release 4.0B
214
Business Continuation During Recovery
What will your company do while the R/3 System is being recovered?
Business continuation is an alternate process to continue doing business while recovering
from a disaster. It includes:
< Cash collection
< Order processing
< Product shipping
< Bill paying
< Payroll processing
< Alternate location to continue business
Why
Without an alternate process, your company could stop doing business
For example:
< Orders cannot be entered
< Product cannot be shipped
< Money cannot be collected
How
There are many alternate processes, including:
< Manual paper-based
< Stand alone PC-based products
Offsite Disaster Recovery 8ites
< Other company sites
< Commercial disaster recovery sites
< Share or rent space from other companies
ntegration with your Company's General Disaster Planning
Because there are many interdependencies, the R/3 disaster recovery process must be
integrated into your companys general disaster planning. This process includes telephone,
network, product deliveries, mail, and so on.
When the R/3 8ystem Returns
How will the transactions that were handled with the alternate process be entered into R/3
when R/3 is back in production?
Chapter 2: Disaster Recovery
Test your Disaster Recovery Procedure
R/3 System Administration Made Easy
215
Test your Disaster Recovery Procedure
Do you know if you can actually recover the system?
Unless you test your recovery process, you do not know if you can actually recover
your system.
A test is a simulated disaster recovery done to verify that you can recover the system and
exercise every task outlined in the disaster recovery plan.
< Test to find out if:
Your disaster recovery procedure works
Something changed, was not documented, or updated
There are steps that need clarification for others
What is clear to the person documenting the steps may be unclear to another
person reading the documentation.
Older hardware is no longer available
If so, alternate planning needs to be done and tested. You may have to upgrade your
hardware to be compatible with currently available equipment.
Since many factors affect recovery time, actual recovery times can only be determined by
testing. Once you have actual times (not guesses or estimates), your disaster planning
becomes more credible.
If the procedure is practiced often, when a disaster occurs, everyone will know what to do.
This way, the chaos of a disaster will be reduced.
How
< Execute your disaster recovery plan on a backup system or at an offsite location.
< Generate a random disaster scenario
< Execute your disaster plan to see if it handles the scenario.
When
< A full disaster recovery should be practiced at least once a year.
Where
< The disaster recovery test should be done at the same site that you expect to recover.
If you have multiple recovery sites, perform a test recovery at each site. The
equipment, facilities, and configuration may be different from site to site. Document
all specific items that need to be completed for each site. If you do not test at a
particular site, you will not know if recovery is possible at that site. You do not want
to discover that you cannot recover at a site after a disaster occurs.
< A backup onsite server
< Another company site
Chapter 2: Disaster Recovery
Other Considerations
Release 4.0B
216
< At another company where you have a mutual support agreement
< A company that provides disaster recovery site and services
Who 8hould Participate
< Primary and backup personnel who will do the job during a real disaster recovery
A provision should be made that some of the primary personnel will be unavailable
during a disaster recovery. A test procedure might be to pull a name from a hat, and
declare that person unavailable to participate. This duplicates a real situation in which
a primary person is seriously injured or killed.
< Personnel at other sites
Integrate these people into the test, since they may be needed to perform the
recovery during a real disaster. These people will fill in for unavailable personnel.
Other Considerations
Other Up- or Downstream Applications
Other up- or downstream applications also need to be recovered with R/3 for the company
to function. Some of these may be very tightly associated with R/3. These applications
should be accounted for and protected in the company-wide disaster recovery planning.
Be careful of applications which are located only on one persons desktop computer.
Backup 8ites
Having a contract with a disaster recovery site does not guarantee that the site will be
available. In a regional disaster, such as an earthquake or flood, many other companies will
be competing for the same commercial disaster sites. In this situation, you may not have a
site to recover to, if others have booked it before you.
The emergency backup site may not have equipment of the same performance level as your
production system. Reduced performance and transaction throughput must be considered.
Examples:
< A reduced batch schedule of only critical jobs.
< Only essential business tasks will be done while on the recovery system.
Chapter 2: Disaster Recovery
Minimizing the Chances for a Disaster
R/3 System Administration Made Easy
217
Minimizing the Chances for a Disaster
These are many ways to minimize chances for disaster. Some of these ideas are quite
obvious, but it is these obvious ones that can be forgotten.
Minimize Human Error
Many disasters are caused by human error, such as a mistake or a tired operator.
< Potentially damaging tasks should be scripted, and checkpoints should be included
to verify what is being done before doing dangerous tasks such as:
Deleting the test database
Check that the delete command specifies the Test database, not the
Production database.
Moving a file
Verify that the target file (to be overwritten) is the old file, not the new file.
Formatting a new drive
Verify that the new drive will be formatted, not an existing drive with data on it.
< When you are tired, get a second opinion before doing something potentially
dangerous.
Do not attempt dangerous tasks when you are tired.
Minimize 8ingle Points of Failure
A single-point failure is when the failure of a single component causes the entire system to
fail or crash.
To minimize single point-failure:
< Identify conditions where a single-point failure can occur
< Anticipate what will happen if this component or process fails
< Eliminate as many of these single points of failure as practical.
Practical is defined by the amount of work or cost compared to the risk of failure.
Examples:
< The backup R/3 server is located in the same data center as the production R/3 server.
If the data center is destroyed, the backup server is also destroyed.
< All the R/3 servers are on a single electrical circuit.
If the circuit breaker opens, everything on that circuit loses power, and all the servers
will crash.
Chapter 2: Disaster Recovery
Minimizing the Chances for a Disaster
Release 4.0B
218
Cascade Failures
A cascade failure occurs when one failure triggers additional failures, which increases the
complexity of a problem. The recovery involves the coordinated fixing of many problems.
For example note the following progression of events:
1. A power failure in the air conditioning system causes an environmental (air
conditioning) failure in the server room.
2. Without cooling, the temperature in the server room rises above the equipments
acceptable operating temperature.
3. The overheating causes a hardware failure in the server.
4. The hardware failure causes a database corruption.
In addition, overheating can damage many things, so you may not know what else was
damaged, such as:
Network equipment
Phone system
Other servers
The recovery becomes complex because:
< Fixing one problem may uncover other problems or damaged equipment.
< Dependencies where certain things cannot be tested or fixed until other equipment are
operational.
In this case, a system that monitors the air conditioning system and/or the temperature in
the server room could alert the appropriate people before the temperature in the server
room becomes too hot.

Das könnte Ihnen auch gefallen