IT Contigency Plan Model For DR Strategy Selection

An Empirical IT Contingency Planning Model for Disaster Recovery Strategy Selection
Montri Wiboonrat, Graduate School of Information Technology, Assumption University, Bangkok, Thailand mwiboonrat@gmail.com, montri.w@tss-design.com
AbstractIn todays banking industry, 24x365 hours of service availability is of utmost importance to gaining competitive advantage. Proper IT contingency planning (ITCP) for disaster recovery (DR) insures business continuity and optimizes investment. This research investigates fundamental requirements of each banking business unit for mapping criticality of business continuity to DR readiness. The process assesses the recovery time objective (RTO) and the recovery point objective (RPO) to assure business continuity within a maximum tolerable period of disruption (MTPD). The ITCP model serves as decision-support standards and rationale for choosing the most appropriate disaster recovery solution for differing business unit requirements. Index Terms Information Technology Contingency Planning (ITCP), Disaster Recovery (DR), Business Impact Analysis (BIA), Business Continuity Plans (BCP).
planning is thus key to optimizing operations and investment. This paper aims at developing an IT contingency planning model for disaster recovery. An exploratory research is conducted by collecting information, conducting key factors analysis and critical levels analysis, formulating recovery strategies, mapping various critical levels to Tier solutions, and eventually recommending appropriate actions. Definitions - Business Continuity Planning (BCP): Assessing consequences and impacts of expected or unexpected downtimes due to disasters as well as hardware or software failures [8]. - Disaster: Natural, technological, and human-induced events that disrupt the normal functioning of the system on a large scale [10]. - Disaster recovery: A set of activities executed once the disaster occurs, including the use of backup facilities to provide users of IT systems with access to data and functions required to sustain business processes. - Recovery Point Objective (RPO): The point in time from which data must be restored in order to resume processing transactions. - Recovery Time Objective (RTO): The period of time allowed for recovery, i.e. time that can elapse between the disaster and the activation of the secondary site. - Maximum Tolerable Period of Disruption (MTPD): A maximum acceptable downtime to guarantee business continuity. - Data Availability: A system process ensures minimum data loss (L). It requires that all active/ standby/ parallel sites in a corporation have copies of critical data. This can be achieved by replicating data between the primary and secondary sites. The original data must be reproduced within acceptable time required to meet business MTPD [9]. II. BACKGROUND Commercial banks rely heavily on online, real-time, and uninterrupted financial transactions data. Once a disaster arises, security, reliability, availability, and accuracy of information can be maintained by promptly resuming to normal operations no matter what disaster levels are [1]. The business continuity plans ensure availability of critical financial services and data by providing appropriate guidance and examination procedures for monitoring processes and performing risk management [5].
I. INTRODUCTION
banking operations rely greatly on critical transaction data, continuous availability of information and fast recovery from system failures may spell differences between success and failure of the financial sector. Uninterrupted service of such IT system constituent as networks, servers, storages, and integrated systems is crucial. Zero downtime is an ultimate goal, although it is practically unachievable. At times, at IT system is subject to planned or unplanned service disruptions. IT contingency planning, business continuity planning, and disaster recovery planning are required to ensure proper handling of the disaster and to promptly resume normal operations [4]. Some recommended disaster recovery measures include [11]: - Worst-case scenario planning for a disaster - Initiating strategies for recovering critical business data or processes - Implementing technologies to support the recovery of automated functions and systems - Training involved operators on operational and contingency processes for handling with all unexpected incidents Helms [7] stated that preventive measures are more important than recovery measures. Pre-planned procedures for system recovery represent significant part of IT contingency planning, particularly for companies whose critical business functions rely mostly on data communication. Proper IT contingency
S
978-1-4244-2289-0/08/$25.00 2008 IEEE
2 BS 25999-2, Business Continuity Management-Part 2: Specification for business continuity management (BCM) specifies requirements for setting up and managing an effective business continuity management system (BCMS). BCMS outlines the business continuity management programs, which emphasize the importance of [2]: a) Understanding business continuity needs and the necessity for establishing policy and objectives for business continuity; b) Implementing and operating controls for managing an organizations overall business continuity risks; c) Monitoring and reviewing the performance and effectiveness of the BCMS; d) Continual improvement based on objective measurement. A management system consists of: a) People with defined responsibilities; b) Management processes relating to: 1) Policy, 2) Planning; 3) Implementation and operation, 4) Performance assessment, 5) Improvement, 6) Management review; c) A set of documentation providing auditable evidence; d) Topic-specific processes, e.g. business impact analysis (BIA), business continuity plan development. However, business continuity management represents insurance or investment for handling mishaps, which may or may not happen to businesses. The awareness of BCM remains low until such disasters as September 11 or Tsunami strike. The investment does not provide immediate return. Moreover, it is very difficult to justify its expenditure [11]. III. IT CONTINGENCY PLANNING METHODOLOGY A. Exploring the IT Contingency Concepts To explore the concepts of IT contingency planning (ITCP), this research conducted a survey at a top-rank commercial bank in Thailand. Data were collected in 2 stages: First: The researcher conducted interviews with 60 employees, including 10 top executives, 15 IT team members, 10 application development members, 10 data entrees, and 15 end users. Second: The researcher developed a multiple-choice questionnaire, based on findings from the interviews in the first stage. One hundred sets of questionnaire were sent via email to qualified respondents, who are directly involved in ITCP and BCP. A 100% response rate is required from the questionnaire. The questionnaire and interview are assessed on the key factors that impacted on selecting for solution to meet business objectives and requirements in terms of performance, capacity, capability, investment, and implementation times. Those 4 factors are evaluated under BIA. These are concentrated on 4 areas: Finance, Reputation, Operation, and Regulation.
TABLE I SUBJECT AREAS OF DISASTER RECOVERY PLANNING ACTIVITIES
B. The IT Contingency Planning Conceptual Framework The ITCP conceptual standard is a roadmap/ guideline to creating banking standardized procedure for disaster recovery plans (DRP). ITCP is classified into 4 stages, as shown in Fig. 1: Stage I: Gathering and Collection information from 2 original sources after they are manipulated through 4 factors. Stage II: Mapping 4 factors to critical levels by considering them together with internal and external information. Stage III: Mapping critical levels to disaster recovery strategy based on present available technologies. Stage IV: Giving recommendation with decision-making information supports.
Fig. 1. ITCP Procedural Standard for Disaster Recovery Plans
The DRP activity processes will show up in ITCP procedural standard for DRP as illustrated in Fig. 1. The DRP process applied from [3] illustrates the all activities performed during DRP. The DRP is classified to 10 subject areas, as seen in Table I.
Subject areas of disaster recovery planning activities

Acronym
PIM
Subject Area
Scope
REC
BIA
BCMS ERO BCCM
AT MEP CC EA
Project Initiation and Management Initiation, organizing and coordinating activities within other subject areas and coordinating disaster recovery plans with overall risk management strategy. Risk Evaluation and Control Determining possible causes of a disaster and potential damange, and assessing organizational and technical controlling mechanisms helping to prevent the damage. Business Impact Analysis Assessing the impact of disaster in terms of business activities, discovering adequate time frames and quantifying the risk. Developing Business Continuity Selecting of operating strategies allowing to meet identified Management Strategies time requirements. Emergency Response and Establishing procedures for initiating and managing the Operatoins process of recovery after disaster. Developing and Implementing Preparing detailed recovery plans. Business Continuity and Crisis Management Plans Awareness and Training Programs Informing and training staff to facilitate the execution of disaster recovery plans and procedures. Maintaining and Exercising Plans Updating plans to account for organizational changes and organizing practical exercises. Crisis Communications Planning activities providing for coordination with external and internal stakeholders. Coordination with Exteranl Planning for coordination with government agencies and Agencies achieving compliance with external regulations.
3 C. Mapping Four Factors to Critical Levels The research identifies the ranking score of 4 factors as 1, 2, 3, and 4 mark. Score = 4 is implied that the impact factor is the highest and score =1 is the lowest. The result for direct interviews (60) and questionnaires (100) did not only show the processes of mapping 4 factors to critical levels of each disaster solution that reflected each business unit requirements, but also depict the prioritization of each existing banking disaster recovery solutions, as depicted in Table II. After process of direct interviews and questionnaires, the research results come-out with 4 factors: Finance, Reputation, Operation, and Regulation that are the key indicators to measure the critical levels for disaster recovery solutions. Critical level is explained by; if the only one of maximum point of 4 factors is, research accounted as critical level, i.e. Finance (3), Reputation (2), Operation (1), and Regulation (4), as a result, a critical level is 4, as shown in Table II. D. Time Parameters A highly reliable system is coming with high costs. As well as, technology of disaster recovery, if the system tries to reduce RPO and RTO to compensate with data losses (L), it becomes the highest investment technology. Tier# 7 of symmetrical recovery data facility (SRDF) needs to be deliberated, as shown in Fig. 2. If the time of RPO and RTO is close to zero, the technology for recovery strategy shall be automatic failover or load balance to minimize data loss, Tier#7. If a banking business unit has more tolerance for service downtime, meaning that it will take long time to recover the system and data loss, Tier#1. The point in time (PiT) is considered to be a solution to maximize investment and business objectives, as illustrated in Fig. 3 and Table III. The relationship of available tier level and data losses is depicted in Fig. 3 based on timing scale to respond on recovery systems and data. There are 3 factors relevant to this model: Availability Tier of Disaster Recovery, Data Losses, and Recovery Times. Tier#7 is the highest level in terms of system automatic failover or SRDF [13] on operations and interruptions to service equals to zero, thus no recovery time and data losses (L) is close to zero as well.
TABLE II CRITICAL LEVELS FOR DISASTER RECOVERY SOLUTIONS
Critical Levels for Disaster Recovery Solutions
Critical Level 1 Financail Impact Reputations Opearations Regulations Critical Level Priority Critical Level 2 Financail Impact Reputations Opearations Regulations Critical Level Priority Critical Level 3 Financail Impact Reputations Opearations Regulations Critical Level Priority Critical Lever 4 Financail Impact Reputations Opearations Regulations Critical Level Priority Critical Range (CR) 1 0 0 0 1 0.25 2 1 1 1 2 1.25 3 1 1 1 3 1.5 4 1 1 1 4 1.75 CR D 0.25 0.50 0.75 1.00 1 1 0 0 1 0.5 2 2 1 1 2 1.5 3 2 1 1 3 1.75 4 2 1 1 4 2 DR Solution D DD DDD D-DDD 1 1 1 0 1 0.75 2 2 2 1 2 1.75 3 2 2 1 3 2 4 2 2 1 4 2.25 CR C 1.25 1.50 1.75 2.00 1 1 1 1 1 1 2 2 2 2 2 2 3 2 2 2 3 2.25 4 3 2 1 4 2.5 DR Solution C CC CCC B 3 3 2 2 3 2.5 4 3 3 2 4 3 CR B 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3 3 3 2 3 2.75 4 4 3 2 4 3.25 DR Solution B B BB BB BBB BBB A 3 3 3 3 3 3 4 4 4 3 4 3.75 CR A 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4 4 4 4 4 4 DR Solution BBB A A A AA AA AA AAA AAA AAA
Fig. 3. Correlation on Tier Levels and Data Losses (L)
E. Mapping Critical Level to Disaster Recovery Strategy In banking, different business units have various levels of tolerance to service downtime. Recovery system and restoration data of business depend on maximum tolerable period of disruption (MTPD), as predefined in [6] [14]. The relationship of critical level and recovery strategy needs to be identified and forms a standardized pattern. The research result comes out with critical levels: AAA, AA, A, BBB, BB, B, CCC, CC, C, DDD, DD, and D mapping to available tiers of disaster recovery strategy, as illustrated in Table III.
Fig. 2. RPO and RTO Time Parameters
4 select is concern on 5 criteria indicators: Time to Implement, RTO, Investment, System Performance, and System Capacity. The conceptual criteria for selecting solution disaster recovery based on the result of multiplying of all 5 criteria indicators. The least figure, smallest is better, of Total Solution will recommend. In the example of Table V, solution A has been selected because the Total Solution is the least number when compared to solution B and solution C.
TABLE IV MAPPING PATTERN MATRIX FOR DISASTER RECOVERY STRATEGY
TABLE III MAPPING CRITICAL LEVELS TO AVAILABLE TIERS OF RECOVERY STRATEGY

Mapping Critical Levels to Available Tiers of Disaster Recovery Strategy
Critical Levels DR Tier # Solutions
1 1 2 2,3 3 4 4
D DD-DDD
1 2
Description of Tier Point in Times Tape to Provisonal Backup Site Disk PiT Copy, Multi-Hop Romote Logging Concurrent ReEx (RRDF, E-Net, others) Remote Copy Remote Copy with Failover
Tape Real-time Remote Avaible Active Backup Disk Logging System System X X X
RTO 2-7 days 1-3 days
RPO 2-24 hrs 2-24 hrs
C-CC CCC-B BBB-BB AA-A AAA
3 4 5 6 7
X X X X X X X
X X X X X
2-24 hrs
2-24 hrs
12-24 hrs 5-30 mins 1-12 hrs 1-4 hrs 5-10 mins 0-5 mins
Pattern Matrix for Disaster Recovery Strategy

No Hardware (vendor SLA) No-Data Replicated D, DD, DDD N/A N/A Shared Hardware Dedicated Hardware C, CC BB, BBB BBB, A CCC, B BBB, A A, AA, AAA
0-60 mins 0-5 mins
Elrod [4] stated, disaster recovery plans are concerned with the reconstruction and retrieving of information if a primary production facility has been damaged or has been destroyed. The four solutions for disaster recovery strategy are cold sites Tier#1 and Tier#2, warm sites Tier#3 and Tier#4, hot sites Tier#5 and Tier#6, and fault tolerance Tier#7 [12]. Cold sites, Tier#1 and Tier#2, are considered for business unit that requires minimum investment in recovery strategy solutions. On the other hand, it implies that this business unit has a high downtime tolerance. This cold site has only basic infrastructure support, i.e. CRAC, power, cabling, and communication but it does not have any servers and network equipments. These requirements of IT equipments will be under SLA or point in times (PiT). It depends on how fast the business can resume services. Warm sites, Tier#3 and Tier#4, are designed for business unit that has a moderate services downtime. The system recovery design is able to resume business within a day. Services hardware prepared already onsite, waiting for signal from main site operation down to activate systems. Hot sites, Tier#5 and Tier#6, are required for business unit that has high service availability. The acceptable business downtime is only a few hours, thus infrastructure and service equipment is already on standby mode. As the same time, SLA is required to support 24x365 hours. Fault tolerance, Tier#7, is fully parallel with all IT infrastructures and service equipments, and loads balance on communication links. Tier#7 is the highest recovery strategy or automatic failover, thus RTO and RPO approach zero service downtime. This research result maps critical levels to disaster recovery patterns that refer to the existing technology to form a standard pettern, as shown in Table IV. F. Selecting Disaster Recovery Solution Models After banking has disaster recovery strategy, IT team needs to consider in each solution from Tier Selection. International standard for suppliers/ DR solutions selection, at least, it should have 3 solutions for consideration. The process to
Data Replication (Server) Data Replication (SAN)
TABLE V EXAMPLE OF SELECTING DISASTER RECOVERY SOLUTIONS

Value Range months Hours Millions 0.1-1* 0.1-1* Axises Time to Implement RTO Investment 1/Performance 1/Capacity Total Solution Ranking (1 is the best) Solution A Solution B Solution C 10 14 11 0.01 1 0.1 10 5 9 0.11111111 0.25 0.1428571 0.1 0.1428571 0.1111111 0.0011111 0.1785714 0.0142857 1 3 2
Performance* and capacity* during this experimentation assumes the best performance and capacity system solution equal to 1, the rests, below performance and capacity solutions, will be ratio of less than 1, i.e. 0.8, 0.5, or 0.2. From Table V, the results for 5 criteria factors can represent in form of radar graph or spider graph for easy to verify and comparing on scales as depicted in Fig. 4.
Fig. 4. Plotting Results of Evaluation Model on Radar Graph
IV. DISCUSSION Results from mapping 4 factors to critical levels and critical levels to disaster recovery strategy are interpreted with an application prioritization. It helps banks to predefine the procedures of collective action plans during disaster in term of sequential services that need to recovery as following by application priority.
5 The research results after tested on 50 banking applications, finds that 40% of disaster recovery solution for business unit applications is over-investment in term of system reliability. Another 10% is under recovery systems reliability after analyzing and evaluating through road mapping, ITCP model, critical levels to disaster recovery strategies. The rest of 50% applications evaluation is considered as acceptable subject to system reliability and optimal investment. ITCP model helps IT team and financial analysts compare; how to upgrade or degrade to rebalance the application recovery strategy in term of optimizing investment and increasing level of available Tiers to meet business critical requirements. After past-through ITCP process selection, IT recovery system needs to be more resilient and improves the system survivability. Moreover, it must be speeded up the recovery of other infrastructure and data by providing the best information about the status of systems and advance warning of imminent failures. V. CONCLUSION The aim of IT contingency planning model and disaster recovery plans is to minimize financial and reputation losses of banking services interruptions. The solutions are integrated between approach of strategies and multiple technologies combined to achieve the specific requirements of banking business unit. The critical level is identified and the disaster recovery solutions are implemented to optimize the system performance and cost effectiveness. This research offers IT contingency planning model assists to business applications requirements to meet the specific needs identified throughout the business planning objectives and business impacted analyses. Moreover, ITCP model is insured on business continuity operation to the long-term health of banking business. This IT contingency planning model will be applied to a banking dealing with decision maker on how to select the proper disaster recovery solutions for their business unit objectives based on the optimal investment to minimize business losses. ACKNOWLEDGMENT I would like to thank you very much for Dr.Kitti Kosavisutte, who gave me a chance for this exploratory research on the real case study of banking industry and all advises, inspirations, and recommendations through a whole year of research. Special thank go to my adviser: Dr.Chamnong Jungthirapanich, who be more than my trainer, teacher, and, prototype for my dissertation.
[9]
REFERENCES
[1] BCI, Business Continuity Institute, Good Practice Guidelines 2007, A Management Guide to Implementing Global Good Practice in Business Continuity Management. Version 2007.3 October 2007. BS25999, BS 25999-2 Business Continuity Management-Part2: Specification Business Continuity Management, July 2007. Cegiela, R., Selecting Technology for Disaster Recovery. Proceeding of the International Conference on Dependability of Computer Systems, DEPCOS-RELCOMEX06, 2006. Elrod, R., So You Think You Have a Good Business Recovery Plan?Steps an Asset Management Company can take to Recovery from a Major Disaster. www.infosecwriters.com/text_resources/pdf/Good_Business_Recovery_ Plan.pdf, 2005. FFIEC, Federal Financial Institutions Examination Council, Business Continuity Planning Booklet-March 2003 - IT Examination Handbook. March, 2003. Harper, M. A., Chad M. Lawler, Mitchell A. Thornton. IT Application Downtime, Executive Visibility and Disaster Tolerant Computing. http://engr.smu.edu/~mitch/ftp_dir/pubs/citsa05b.pdf, 2005. Helms, R. W., S. van Oorschot, J. Herweijer, and M., Plas. An integral IT continuity framework for undisrupted business operations. Proceedings for the First International Conference on Availability, Reliability and Security, ARES06, 2006. Jrad, A. M., Chun K. Chan, Thomas B. Morawski. Incorporating the Downtime Due to Disaster Events in the Network Reliability Model. Telecommunication Network Strategy and Planning Symposium, NETWORKS 2004, 11th International. 13-16 June 2004. Mckinty, S., Combining Clusters for Business Continuity. Cluster Computing, 2006. IEEE International Conference on, 25-28 Sept. 2006. Page(s): 1-6. NA, National Academies, Improving Disaster Management: The Role of IT in Mitigation, Preparedness, Response, and Recovery. 2007. http://www.nap.edu/catalog/1184.html Rudolph, C. G., Business Continuation Planning/ Disaster Recovery: A Marking Perspective. IEEE Communication Magazine, June 1990. Pp 25-28. Schulman, R. R., Disaster Recovery Issues and Solutions. A White Paper, Enterprise Storage, HITACHI Data System, December 2003. Weaver, R., Remote Recovery-Advanced Technology Solutions for z/OS Recovery. Technical White Paper, BMCsoftware, 2007. Zambon, E., D. Bolzoni, S. Etalle, and M. Salvato. A Model Supporting Business Continuity Auditing & Planning in Information Systems. Second International Conference on Internet Monitoring and Protection, ICIMP 2007.
[2] [3]
[4]
[5]
[6]
[7]
[8]
[10]
[11]
[12] [13] [14]

IT Contigency Plan Model For DR Strategy Selection

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

IT Contigency Plan Model For DR Strategy Selection

Hochgeladen von

Copyright:

Verfügbare Formate

An Empirical IT Contingency Planning Model for Disaster Recovery Strategy Selection

978-1-4244-2289-0/08/$25.00 2008 IEEE

Fig. 1. ITCP Procedural Standard for Disaster Recovery Plans

Subject areas of disaster recovery planning activities

BCMS ERO BCCM

Fig. 3. Correlation on Tier Levels and Data Losses (L)

Fig. 2. RPO and RTO Time Parameters

TABLE III MAPPING CRITICAL LEVELS TO AVAILABLE TIERS OF RECOVERY STRATEGY

RTO 2-7 days 1-3 days

RPO 2-24 hrs 2-24 hrs

C-CC CCC-B BBB-BB AA-A AAA

Pattern Matrix for Disaster Recovery Strategy

0-60 mins 0-5 mins

Data Replication (Server) Data Replication (SAN)

TABLE V EXAMPLE OF SELECTING DISASTER RECOVERY SOLUTIONS

Fig. 4. Plotting Results of Evaluation Model on Radar Graph

[12] [13] [14]

Das könnte Ihnen auch gefallen