Introduction To The New Mainframe: Large-Scale Commercial Computing

Introduction to the new mainframe: Large-Scale Commercial Computing Chapter 5: Availability
Copyright IBM Corp., 2006. All rights reserved.
Introduction to the new mainframe
Chapter objectives After completing this chapter, you will be able to:
Understand what availability means to a commercial enterprise Describe the inhibitors to availability Describe operating system facilities that improve availability Describe the major components of Parallel Sysplex
A real customer requirement: Royal Bank Boosts Availability - Online Banking

IBM System z Parallel Sysplex System
Front End - Internet
WebSphere MQ For z/OS, V5.3
Challenge: Maximize Availability

Back End - Data/Applications
12 million customers 2.5 million online 60,000 employees
DB2 Database IMS Database CICS Applications
Benefits
Reliable integration with internet Supports ~40 web-based applications Efficient use of parallel sysplex Improved customer availability
Introduction to availability
High Availability
Fault-tolerant, failureresistant infrastructure supporting continuous application processing
Continuous Operations
Non-disruptive backups and system maintenance coupled with continuous availability of applications
Disaster Recovery
Protection of critical business data Recovery is predictable and reliable
Protection against unplanned outages such as disasters through reliable, Operations continue after a disaster predictable recovery
Costs are predictable and manageable
What is availability?
Availability is the state of an application being accessible to the end user.
Outage Definition An outage (unavailability) is the time, a system is not available to an end user. Outages may be planned or unexpected (unplanned). Planned outages include causes like data base reorganisation, release changes, and network reconfiguration. Unplanned outages are caused by some kind of a hardware, software or data problem While planned outages can be scheduled, they still are disruptive. The modern trend is to try to avoid planned outages altogether. This requires extensive hardware and software facilities.
Cost of outages (1)
Financial Impact of Downtime Per Hour (by various Industries)

Source: Contingency Planning Research & Strategic Research Corp.
Cost of outages (2)
Types of Outages
Common Causes for Application Downtime

Source: Standish Group Research
Inhibitors to availability
Number of 9s or the Myth of the nines Class of 9s Outage Continous Availability Fault Tolerant High Availability General Purpose Example z/OS Parallel Sysplex S/390 Parallel Sysplex Single IBM System z CPC High available UNIX Cluster Campus LAN
99,999 % 5 min / year 99,99 % 53 min / year 99,9 % 8,8 hrs / year 99 % 88 hrs / year 90 % 876 hrs / year
10
IBM System z9 EC Under the covers (Model S38 or S54)

Internal Batteries (optional) Hybrid Cooling Power Supplies Processor Books and Memory
CEC Cage 3x I/O cages
Support Elements Front View
11
Redundancy IBM Mainframe Hardware

Power
2x Power Supply 2x Power feed
Internal Battery Feature

Optional internal battery in cause of loss of external power)
Cooling Dynamic oscillator switchover Processors

Multiprocessors Spare PUs
Memory
Chip sparing Error Correction and Checking
Enhanced book availability

12
Concurrent Maintenance and Upgrades

Duplex Units
Power Supplies,
Concurrent Microcode (Firmware) updates Hot Pluggable I/O PU Conversion Permanent and Temporary Capacity Upgrades
Capacity Upgrade on Demand (CUoD) Customer Initiated Upgrade (CIU) On/Off Capacity on Demand (On/Off CoD)
Capacity BackUp (CBU)
13
Capacity BackUp (CBU)

Who Needs It?
Any business with a requirement for increased availability or Disaster Recovery
What Is It?
Provides the ability to nondisruptively increment capacity temporarily, when capacity is lost elsewhere in the enterprise Dual Microcode Loads
Provide two machine configurations in one box
Take advantage of "spare" PUs Significant cost savings possible

Standby MIPS cost can be eliminated CBU Server IBM Software license charges on standby MIPS can be eliminated
Production Server
Configure memory and channels to support production workload
How Can I Use It?

Adjacent machines in the same location Multiple images in the same Parallel Sysplex cluster Backup/Recovery site
14
z9 EC Enhanced Book Availability

Book Add Model Upgrade by the addition of a single new book adding physical processors, memory, and I/O Connections Continued Capacity with Fenced Book Make use of the LICCC defined resources of the fenced book to allocate physical resources on the operational books as possible Book Repair Replacement of a defective book when that book had been previously fenced from the system during the last IML Book Replacement Removal and replacement of a book for either repair or upgrade
15
IBM System z9 EC Enhanced Book Replacement (EBR) Flow 1

Book Add Processor Upgrade Add Memory Additional I/O Bandwidth Book Replace/Repair Models S18, S28, S38, S54 only Requires sufficient resources in remaining Book(s) Failed Book Models S18, S28, S38, S54 only Requires sufficient resources in remaining Book(s) Book Replace/Repair Prepare for Book removal via SE Resource reassigned to active Book(s) before repair/replace 'Fence' off Book for removal Failed Book Re-IML system with failed Book 'fenced' off During IML, reassign resource to surviving Book(s) Remove 'fenced' Book for replacement/repair
Resources
3
Old Book
Remove Book to be replaced/repaired Replace with new/repaired Book

New Book
After Book Add/Replace/Repair Restore/Reconfigure

Processors Memory I/O
Resources
16
EBR - Dynamic Memory Move

The Dynamic Memory Move operation concurrently changes the physical memory backing of an absolute storage increment Performed transparent to the Operating System Utilizes the zSeries Copy/Reassign Hardware Used during EBA to:
Move physical memory usage from the targeted book to books that will be remaining in the system. Optimize memory allocation after EBA completion.
Example:
Absolute storage increment 123 is concurrently moved from physical memory increment 1 to physical memory increment 2.
Absolute Storage Space Physical Memory
123 1 2
17
EBR - Redundant I/O Interconnect (RII)

Processor Book 0
STI Multipath Module (STI-MP)
A multiplexer that supports attachment to four I/O features in an I/O domain and has an alternate path to a second STI-MP for a redundant I/O infrastructure.
Processor Book 1 Memory Cards L2 Cache

PU PU PU PU PU PU PU PU Ring Structure
Memory Cards L2 Cache

PU PU PU PU PU PU PU PU
Key Usage
Memory Upgrade Dynamic MBA fanout error recovery Reduction of UIRA outage Book Repair STI cable repair MBA fanout card repair On book add MBA fanouts used for I/O are concurrently rebalanced to the new book STI from Book 0 STI from Book 1
8 MBA Fanout
16 STIs
8 MBA Fanout
16 STIs
I/O Cage
STI 2.7 GB/sec
ICB-4 2 GB/sec
STI daughte r card STI mothe r card
I/O Ports
I/O features I/O features
STI-MP & STI-A8 Cards
I/O Ports
I/O Cage
I/O Feature
FICON Express2
OSA-Express2
18
EBR - Concurrent Physical Processor Reassignment

This operation is used for concurrently changing the physical backing of one or more logical processors The state of source operating physical processor is captured and transplanted into the target physical processor. Expected to be transparent to the operating system. Utilizes the PU sparing function Used during EBA to:
Move processors from the targeted book to spare processors on a book remaining in the system Rebalance processors after EBA completion.
Logical
PU6
Physical PUx
PUy
19
Evolution of RAS for IBM System z high end Systems

z900
Microcode Driver Updates Book Replacement** Memory Replacement ECC on Memory Control Circuitry (EX: SMI) Memory Bus Adapter (MBA) Replacement STI Failure Oscillator Failure Processor Upgrades Physical Memory Upgrades I/O Upgrades Spare PUs
*In select circumstances
z990
Scheduled Outage Scheduled Outage Unscheduled Outage
z9 EC
Concurrent* Concurrent Concurrent (Book Offline) Transparent
6 Hr Scheduled outage 6 Hr Scheduled outage Not Applicable Scheduled Outage Unscheduled Outage
Scheduled Outage. Lose Scheduled Outage. Lose Concurrent. connectivity to I/O connectivity to I/O Connectivity to I/O Domain Domain Domain remains As for MBA As for MBA As for MBA Unscheduled Outage Concurrent Scheduled Outage Concurrent 1 System Unscheduled Outage Concurrent Scheduled Outage Concurrent 2 / Book
Transparent Concurrent Concurrent (Book Offline) Concurrent 2 / System

20
**Customer pre-planning required, may require acquisition of additional hardware resources
Create a redundant I/O configuration

LPAR1 LPAR2 LPARn LPAR1 LPAR2 LPARn
CSS / CHPID
Director (Switch)
DASD CU
DASD CU
....
21
RAS Features of an Storage Subsystem

Independent dual power feeds N+1 power supply technology/hot swappable power supplies, fans N+1 cooling Battery backup Non-Volatile Subsystem cache, to protect writes that have not been hardened to DASD yet Nondisruptive maintenance Concurrent LIC activation Concurrent repair and replace actions RAID architecture Redundant microprocessors and data paths Concurrent upgrade support (that is, ability to add disks while subsystem is online) Redundant shared memory Spare disk drives Remote Copy to a second storage subsystem
Synchronous (Peer to Peer Remote Copy, PPRC) Asynchronous (Extended Remote Copy, XRC)
22
Disk Mirroring using PPRC and XRC

PPRC (Metro Mirror)
Synchronous remote data mirroring
Application receives I/O complete when both primary and secondary disks are updated
XRC (z/OS Global Mirror)

Asynchronous remote data mirroring
Application receives I/O complete as soon as primary disk is updated
Typically supports metropolitan distance Performance impact must be considered

Latency of 10 km
Unlimited distance support Performance impact negligible System Data Mover (SDM) provides
Data consistency of secondary data Central point of control
PPRC
System z z/OS
XRC
SDM
1
2
23
PPRC Failover / Failback (FO/FB)

The new primary volumes (at the remote site) records changes while in failover mode. The original mode of the volumes at the local site is preserved as it was when the failover was initiated. Only need to resynchronize from time of failover, not entire data set
Normal
Failover Failback Start Failback Finish
Application I/Os
Application I/Os
Application I/Os
Application I/Os
A
Sync PPRC
B
Sync PPRC (suspended) C R
A
Sync PPRC (full duplex)
B
O O S
A A
Sync PPRC (full duplex)
B B
24
Parallel Sysplex
Parallel Sysplex
Removes Single Point of Failure
Server LPAR Subsystems
IBM System z
Planned and Unplanned Outages Single System Image Dynamic Session Balancing Dynamic Transaction Routing Highlights
Data sharing Locking Cross-system workload dispatching Synchronization of time for logging, etc. Coupling Facility Sysplex Timer TOD clock synchronization Workload Manager in z/OS Compatibility and exploitation in software subsystems, like DataSharing in Database systems
IBM System z
IBM System z
Hardware/software combination
25
z/OS factors to availability Workload Balancing using Workload Manager (WLM) Highly automated system Capability to restart applications using the Automatic Restart Manager (ARM) without interfering other applications or the z/OS itself Assists Two-Phase commits using Resource Recovery Services (RRS) Make dynamicly changes to your system configuration using the System Modification Program Extended (SMP/E)
26
Error recording and error recovery routines
27
z/OS Recovery z/OS Recovery features

Recovery Termination Manager (RTM) Extended Specify Task Abnormal Exit (ESTAE) Functional Recovery Routine (FRR)
28
The Human Factor . Automation: critical for successful rapid recovery and continuity
The More People Involved.. .. The Higher the Odds of Human Errors. The benefits of automation:
Allows business continuity processes to be built on a reliable, consistent recovery time Recovery times can remain consistent as the system scales to provide a flexible solution designed to meet changing business needs Reduce infrastructure management cost and staffing skills Reduces or eliminates human error during the recovery process at time of disaster Facilitates regular testing to help ensure repeatable, reliable, scalable business continuity Helps maintain recovery readiness by managing and monitoring the server, data replication, workload and the network along with the notification of events that occur within the environment
29
Tiers of Disaster Recovery

GDPS/PPRC RTO < 1 hr; RPO 0 GDPS/XRC GDPS/Global Mirror RTO < 2 hr; RPO < 1min
Mission Critical Applications
Dedicated Remote Hot Site
Value
GDPS/PPRC HyperSwap Manager Tier 7 - Near zero or zero Data Loss: Highly automated takeover RTO depends on customer automation; on a complex-wide or business-wide basis, using remote disk RPO 0 mirroring Tier 6 - Near zero or zero Data Loss remote disk mirroring helping with data integrity and data consistency Tier 5 - software two site, two phase commit (transaction integrity); or repetitive PiT copies w/ small data loss Tier 4 - Batch/Online database shadowing & journaling, repetitive PiT copies, fuzzy copy disk mirroring Tier 3 - Electronic Vaulting Tier 2 - PTAM, Hot Site
Somewhat Critical Applications
Active Secondary Site
Not so Critical Applications
Point-in-Time Backup
15 Min. 1-4 4 -6 6-8 8-12 12-16 24 72
Tier 1 - PTAM*
Time to Recover (hrs)
Tiers based on Share Group 1992 *PTAM = Pickup Truck Access Method
30
Todays Business Continuity Objectives Demand Rapid Database Availability
Achieve Application and Database Restart

Consistent, repeatable, fast Database Restart: To start a database application following an outage without having to restore the database This is a process measured in minutes
Avoid Application and Database Recovery

Unpredictable recovery time, usually very long and very labor intensive Database Recovery: Restore last set of Image Copy tapes and apply log changes to bring database up to point of failure This is a process measured in hours or even days
31

NETWORK
What is GDPS/PPRC? (Metro Mirror)

NETWORK
11 10 9 8 7
12
1 2 3 4
SITE 2
11 10 9 8 7
12 1 2 3 4 6 5
SITE 1
Multi-site base or Parallel Sysplex environment Remote data mirroring using PPRC Manages unplanned reconfigurations z/OS, CF, disk, tape, site Designed to maintain data consistency and integrity across all volumes Supports fast, automated site failover No or limited data loss - (customer business policies) Single point of control for Standard actions Parallel Sysplex Configuration management User defined script (e.g. Planned Site Switch) PPRC Configuration management
Stop, Remove, IPL system(s)
32
Multiple Site Workload - Cross-site Sysplex Continuous Availability Configuration

SITE 1
11 12 1 2 3 8 7 6 5 4 10 9 8 7 6 5 4
SITE 2
11 12 1 2 3
CF1
10 9
PROD
CF2 PROD CBU
P1 K2
P2
P3
P4 K1
K/L
K/L
33
Continuous Availability and Disaster Recovery at unlimited distance (GDPS/PPRC & GDPS/XRC)
IBM System z Solution

Production Site 1
CF Parallel Sysplex CF
metropolitan distance
Site 2
unlimited distance
Site 3
CF Parallel Sysplex CF
FICON or ESCON
P'
PPRC secondary
GDPS/ PPRC
PX
PPRC primary XRC primary
GDPS/XRC
X'
XRC secondary
Continuous Availability GDPS PPRC or GDPS/PPRC HM

Designed to provide continuous availability and
Disaster/Recovery
Production site 1 failure
no data loss between sites 1 and 2 Sites 1 and 2 can be same building or campus distance to minimize performance impact
Site 3 can recover with no data loss in most instances Production can continue with site 1 data (P') SIte 3 can recover with minimal loss of data
Site 2 failure
Site 1 and 2 failure
34
SUMMARY
Built In Redundancy Capacity Upgrade on Demand Capacity Backup Hot Pluggable I/O
Addresses Planned/Unplanned Hardware and Software Outages Flexible, Nondisruptive Growth Capacity beyond largest CEC Scales better than SMPs Dynamic Workload/Resource Management
Addresses Site Failure/Maintenance Sync/Async Data Mirroring Eliminates Tape/Disk SPOF No/Some Data Loss Application Independent
35
Key terms in this chapter ARM Automate Availability CA Data sharing Disaster Disk mirroring GDPS HA LPAR MTBF N+1 Recover SMP/E SPOF Sysplex Sysplex Timer System log Trace
36

Introduction To The New Mainframe: Large-Scale Commercial Computing

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Introduction To The New Mainframe: Large-Scale Commercial Computing

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to the new mainframe: Large-Scale Commercial Computing Chapter 5: Availability

Copyright IBM Corp., 2006. All rights reserved.