Beruflich Dokumente
Kultur Dokumente
Chapter objectives After completing this chapter, you will be able to:
Understand what availability means to a commercial enterprise Describe the inhibitors to availability Describe operating system facilities that improve availability Describe the major components of Parallel Sysplex
Benefits
Reliable integration with internet Supports ~40 web-based applications Efficient use of parallel sysplex Improved customer availability
Copyright IBM Corp., 2006. All rights reserved.
Introduction to availability
High Availability
Fault-tolerant, failureresistant infrastructure supporting continuous application processing
Continuous Operations
Non-disruptive backups and system maintenance coupled with continuous availability of applications
Disaster Recovery
Protection against unplanned outages such as disasters through reliable, Operations continue after a disaster predictable recovery
What is availability?
Outage Definition An outage (unavailability) is the time, a system is not available to an end user. Outages may be planned or unexpected (unplanned). Planned outages include causes like data base reorganisation, release changes, and network reconfiguration. Unplanned outages are caused by some kind of a hardware, software or data problem While planned outages can be scheduled, they still are disruptive. The modern trend is to try to avoid planned outages altogether. This requires extensive hardware and software facilities.
Types of Outages
Inhibitors to availability
Number of 9s or the Myth of the nines Class of 9s Outage Continous Availability Fault Tolerant High Availability General Purpose Example z/OS Parallel Sysplex S/390 Parallel Sysplex Single IBM System z CPC High available UNIX Cluster Campus LAN
99,999 % 5 min / year 99,99 % 53 min / year 99,9 % 8,8 hrs / year 99 % 88 hrs / year 90 % 876 hrs / year
10
11
Memory
Chip sparing Error Correction and Checking
12
Concurrent Microcode (Firmware) updates Hot Pluggable I/O PU Conversion Permanent and Temporary Capacity Upgrades
Capacity Upgrade on Demand (CUoD) Customer Initiated Upgrade (CIU) On/Off Capacity on Demand (On/Off CoD)
13
What Is It?
Provides the ability to nondisruptively increment capacity temporarily, when capacity is lost elsewhere in the enterprise Dual Microcode Loads
Provide two machine configurations in one box
14
15
3
Old Book
Resources
16
Example:
Absolute storage increment 123 is concurrently moved from physical memory increment 1 to physical memory increment 2.
Absolute Storage Space Physical Memory
123 1 2
17
Key Usage
Memory Upgrade Dynamic MBA fanout error recovery Reduction of UIRA outage Book Repair STI cable repair MBA fanout card repair On book add MBA fanouts used for I/O are concurrently rebalanced to the new book STI from Book 0 STI from Book 1
8 MBA Fanout
16 STIs
8 MBA Fanout
16 STIs
I/O Cage
ICB-4 2 GB/sec
I/O Ports
I/O features I/O features
STI-MP & STI-A8 Cards
I/O Ports
I/O Cage
I/O Feature
FICON Express2
OSA-Express2
18
Logical
PU6
Physical PUx
PUy
19
z990
Scheduled Outage Scheduled Outage Unscheduled Outage
z9 EC
Concurrent* Concurrent Concurrent (Book Offline) Transparent
6 Hr Scheduled outage 6 Hr Scheduled outage Not Applicable Scheduled Outage Unscheduled Outage
Scheduled Outage. Lose Scheduled Outage. Lose Concurrent. connectivity to I/O connectivity to I/O Connectivity to I/O Domain Domain Domain remains As for MBA As for MBA As for MBA Unscheduled Outage Concurrent Scheduled Outage Concurrent 1 System Unscheduled Outage Concurrent Scheduled Outage Concurrent 2 / Book
Copyright IBM Corp., 2006. All rights reserved.
CSS / CHPID
Director (Switch)
DASD CU
DASD CU
....
21
22
Unlimited distance support Performance impact negligible System Data Mover (SDM) provides
Data consistency of secondary data Central point of control
PPRC
System z z/OS
XRC
SDM
1
2
23
Application I/Os
Application I/Os
Application I/Os
Application I/Os
A
Sync PPRC
B
Sync PPRC (suspended) C R
A
Sync PPRC (full duplex)
B
O O S
A A
Sync PPRC (full duplex)
B B
24
Parallel Sysplex
Parallel Sysplex
Removes Single Point of Failure
Server LPAR Subsystems
IBM System z
Planned and Unplanned Outages Single System Image Dynamic Session Balancing Dynamic Transaction Routing Highlights
Data sharing Locking Cross-system workload dispatching Synchronization of time for logging, etc. Coupling Facility Sysplex Timer TOD clock synchronization Workload Manager in z/OS Compatibility and exploitation in software subsystems, like DataSharing in Database systems
IBM System z
IBM System z
Hardware/software combination
25
z/OS factors to availability Workload Balancing using Workload Manager (WLM) Highly automated system Capability to restart applications using the Automatic Restart Manager (ARM) without interfering other applications or the z/OS itself Assists Two-Phase commits using Resource Recovery Services (RRS) Make dynamicly changes to your system configuration using the System Modification Program Extended (SMP/E)
26
27
28
The Human Factor . Automation: critical for successful rapid recovery and continuity
The More People Involved.. .. The Higher the Odds of Human Errors. The benefits of automation:
Allows business continuity processes to be built on a reliable, consistent recovery time Recovery times can remain consistent as the system scales to provide a flexible solution designed to meet changing business needs Reduce infrastructure management cost and staffing skills Reduces or eliminates human error during the recovery process at time of disaster Facilitates regular testing to help ensure repeatable, reliable, scalable business continuity Helps maintain recovery readiness by managing and monitoring the server, data replication, workload and the network along with the notification of events that occur within the environment
Copyright IBM Corp., 2006. All rights reserved.
29
Value
GDPS/PPRC HyperSwap Manager Tier 7 - Near zero or zero Data Loss: Highly automated takeover RTO depends on customer automation; on a complex-wide or business-wide basis, using remote disk RPO 0 mirroring Tier 6 - Near zero or zero Data Loss remote disk mirroring helping with data integrity and data consistency Tier 5 - software two site, two phase commit (transaction integrity); or repetitive PiT copies w/ small data loss Tier 4 - Batch/Online database shadowing & journaling, repetitive PiT copies, fuzzy copy disk mirroring Tier 3 - Electronic Vaulting Tier 2 - PTAM, Hot Site
Point-in-Time Backup
15 Min. 1-4 4 -6 6-8 8-12 12-16 24 72
Tier 1 - PTAM*
Tiers based on Share Group 1992 *PTAM = Pickup Truck Access Method
30
31
11 10 9 8 7
12
1 2 3 4
SITE 2
11 10 9 8 7
12 1 2 3 4 6 5
SITE 1
Multi-site base or Parallel Sysplex environment Remote data mirroring using PPRC Manages unplanned reconfigurations z/OS, CF, disk, tape, site Designed to maintain data consistency and integrity across all volumes Supports fast, automated site failover No or limited data loss - (customer business policies) Single point of control for Standard actions Parallel Sysplex Configuration management User defined script (e.g. Planned Site Switch) PPRC Configuration management
Stop, Remove, IPL system(s)
32
SITE 2
11 12 1 2 3
CF1
10 9
PROD
P1 K2
P2
P3
P4 K1
K/L
K/L
33
Continuous Availability and Disaster Recovery at unlimited distance (GDPS/PPRC & GDPS/XRC)
metropolitan distance
Site 2
unlimited distance
Site 3
CF Parallel Sysplex CF
FICON or ESCON
P'
PPRC secondary
GDPS/ PPRC
PX
PPRC primary XRC primary
GDPS/XRC
X'
XRC secondary
Disaster/Recovery
Production site 1 failure
no data loss between sites 1 and 2 Sites 1 and 2 can be same building or campus distance to minimize performance impact
Site 3 can recover with no data loss in most instances Production can continue with site 1 data (P') SIte 3 can recover with minimal loss of data
Site 2 failure
34
SUMMARY
Built In Redundancy Capacity Upgrade on Demand Capacity Backup Hot Pluggable I/O
Addresses Planned/Unplanned Hardware and Software Outages Flexible, Nondisruptive Growth Capacity beyond largest CEC Scales better than SMPs Dynamic Workload/Resource Management
Addresses Site Failure/Maintenance Sync/Async Data Mirroring Eliminates Tape/Disk SPOF No/Some Data Loss Application Independent
35
Key terms in this chapter ARM Automate Availability CA Data sharing Disaster Disk mirroring GDPS HA LPAR MTBF N+1 Recover SMP/E SPOF Sysplex Sysplex Timer System log Trace
36