Sie sind auf Seite 1von 34

1 © 2013 IBM Corporation

High Availability Options


 Server failover
– Shared disk or remote disk mirroring
 HADR
– HA and/or Disaster Recovery
– Easy to set up & manage
– Automatic failover with TSA integration
– Fast failover
 pureScale (Active / Active)
– Continuous Availability
– Load Balancing
– Easy to set up & manage
 Q-Replication
– Flexible – can handle database subsets
– Can be complex to set up but offers extensive flexibility
– Active/Active
– Asynchronous

2 © 2013 IBM Corporation


Server-Based Failover

Clients

• DB2 ships with an integrated


TSA cluster manager
tx tx
– Node Failure Detection
– Disk takeover
– IP takeover
– Restart DB2
Active Server Active Server
• Management framework
included to keep the cluster
topology in sync

3 © 2013 IBM Corporation


High Availability Disaster Recovery (HADR)
• First Introduced in DB2 8.2
• Technology first appeared in IDS in 1994
• Provides local High Availability and/or Disaster Recovery
– Keeps two copies of a database in sync with each other on two different servers
• Simple to setup and manage
• DB2 9.5 adds an integrated cluster manager for automatic failover
• DB2 10 adds multiple standbys, time delay and log buffering to handle network
spikes

4 © 2013 IBM Corporation


Main Goals of the HADR Design
• Ultra-fast failover

• Easy administration

• Negligible impact on performance

• Software upgrades without interruption

• Transparent failover and fallback for applications

• Cluster Manager software (TSA) included with DB2

5 © 2013 IBM Corporation


DB2 Delivers Fast failover at Low Cost
 Redundant copy of the database to protect against site or storage failure
 Support for Rolling Upgrades
 Failover typically under 15 seconds
– Example: Real SAP workload, 600 SAP users – database available in 11 sec.
 100% performance after primary failure

Automatic Client Reroute


Client application transparently resumes on Standby

Network Connection TSA for server monitoring


tx tx
tx tx Built in cluster manager
Monitors primary & performs takeover
HADR
Keeps the two
servers in sync

Primary Server Standby Server

Standby Server Primary Server


6 © 2013 IBM Corporation
Synchronization Modes
Synchronous, Near Synchronous,
Asynchronous and Super Asynchronous

HADR HADR
send() receive()

u s
no
As

o
hr
yn

log writer
c
ch

n
Sy
ro

r-
no

ea
N
us

Co
m
m
it s
log Re ro nou
qu Synch log
file es
t
Sup file
er A
sync
hro n
ous

Commit Succeeded

7 © 2013 IBM Corporation


Failover
• Single command called "TAKEOVER"
– Change the standby into a primary
– Switch the roles of a healthy primary-standby pair
– No db2start / restart database / rollforward etc.

• Integrated TSA provides heartbeat monitoring & automated “TAKEOVER”


– Set up for you during DB2 installation

• Automatic client re-route (ACR) provides transparent failover


– And will rerun the statement that was running when the failure occurred as long as it’s
the first statement of a transaction with no data yet returned.

8 © 2013 IBM Corporation


HADR Multiple Standbys (cont.)

Principal Standby
ode
sync m
Any

Primary Super async mode Auxiliary Standby

Super as
yn c mode
Auxiliary Standby

• Allows for one standby for high availability and up to two other standbys for disaster
recovery
– Rolling fix pack updates of standbys and primary without losing HA

• Reads on standby supported on all standbys

• Takeover (forced and non-forced) supported from any standby


– After takeover, configuration parameters on new primary’s standbys will be changed automatically
so they point to the new primary

9 © 2013 IBM Corporation


Log Spooling on the Standby

• When enabled, allows the standby to spool log records arriving from the primary

• Decouples log replay on the standby from receiving of the log data from the
primary

• Supported with any synchronization mode

Primary Standby

super async mode

Spooled logs
on standby
10

10 © 2013 IBM Corporation


Time-Delayed Apply on the Standby
• Helps recover from application errors
– For example, accidental deletion of important table data
– Must be noticed before time delay on standby results in change
being replayed

• Enabled via the new HADR_REPLAY_DELAY database configuration parameter


– Specifies a delay in seconds for applying changes on a standby
– A value of 0 means no time delay (the default)

Primary Standby

super async mode

Spooled logs
on standby

11 © 2013 IBM Corporation


DB2 pureScale

 Extreme Capacity
– Buy only what you need, add capacity as your
needs grow

 Application Transparency
– Avoid the risk and cost of application changes

 Continuous Availability
– Deliver uninterrupted access to your data with
consistent performance

Learning from the undisputed Gold Standard... System z

12 © 2013 IBM Corporation


DB2 pureScale Architecture

Automatic workload balancing

Cluster of DB2 nodes running


on Power or Intel servers

Leverages the global lock and


memory manager technology
from z/OS

Integrated Cluster
Manager

InfiniBand or Ethernet & DB2


Cluster Services

Shared Data
13 © 2013 IBM Corporation
DB2 pureScale : Technology Overview
Leverage IBM’s System z Sysplex Experience and Know-How
Clients
Clients connect anywhere,…
… see single database
– Clients connect into any member
Single Database View – Automatic load balancing and client reroute
DB2 engine runs on several hosts
– Co-operate with each other to provide coherent
access to the database from any member

Member Member Member Member


Integrated Cluster Services
– Failure detection, recovery automation
CS CS CS CS
Low latency, high speed interconnect
– Special optimizations provide significant advantages
on RDMA-capable interconnects (eg. Infiniband)
Cluster Interconnect
Cluster Caching Facility (CF)
CS CS
– Efficient global locking and buffer management
2nd-ary Log Log Log Log Primary – Synchronous duplexing to secondary for availability

Shared Storage Access Data sharing architecture


– Shared access to database
Database – Members write to their own logs
– Logs accessible from another host
14 © 2013 IBM Corporation
What Happens in DB2 pureScale to Read a Page
Agent on Member 1 wants to read page 501
1. db2agent checks local buffer pool: page not found
2. db2agent performs Read And Register (RaR) RDMA call directly into CF memory
– No context switching, no kernel calls.
– Synchronous request to CF
3. CF replies that it does not have the page (again via RDMA)
4. db2agent reads the page from disk

PowerHA pureScale

Member 1 CF
Buffer Pool 3 Group Buffer
Pool
501 1 501
4

db2agent 2

15 © 2013 IBM Corporation


The Advantage of DB2 Read and Register with RDMA
1. DB2 agent on Member 1 writes directly into CF memory with:
– Page number it wants to read
– Buffer pool slot that it wants the page to go into
2. CF either responds by writing directly into memory on Member 1:
– That it does not have the page or
– With the requested page of data

• Total end to end time for RAR is measured in microseconds


• Calls are very fast, the agent may even stay on the CPU for the response

Member 1 Direct remote memory CF


write with request
db2agent CF thread

1,I Eaton,
don’t 10210, SW
have it, I want page 501.
2, Smith, 10111, NE Put into slot 42
get it from disk
3, Jones, 11251, NW Direct remote memory of my buffer pool.
write of response

Much more scalable, does not require locality of data


Much more scalable, does not require locality of data

16 © 2013 IBM Corporation


Workload Balancing
• Run-time load information to automatically balance load across the cluster
– Shares design with system z Sysplex
– Load information of all members kept on each member
– Piggy-backed to clients regularly
– Used to route next connection (or optionally next transaction) to least loaded member
– Routing occurs automatically (transparent to application)

Clients Clients

17 © 2013 IBM Corporation


Online Recovery
• DB2 pureScale design point is to maximize
availability during failure recovery DB2 DB2 DB2 DB2
processing

• When a database member fails, only in- Lo g Lo g Lo g Lo g

flight data remains locked until member


recovery completes
CF Shared Data CF
– In-flight = data being updated on the
failed member at the time it failed
Database member failure

• Target time to row availability Only data in-flight updates locked


during recovery
– <20 seconds

% of Data Available
100

50

Time (~seconds)

18 © 2013 IBM Corporation


Summary (Single Failures) Other
Members
Remain Automatic &
Failure Mode Online ? Transparent ?

DB2 DB2 DB2 DB2

Member
Connections to failed
CF CF

member transparently move


to another member

Primary
CF DB2 DB2 DB2 DB2

CF CF

Secondary DB2 DB2 DB2 DB2

CF
CF CF

19 © 2013 IBM Corporation


Simultaneous Failures Other
Members
Automatic &
Remain
Failure Mode Transparent ?
Online ?

DB2 DB2 DB2 DB2

Connections to failed
CF CF

member transparently move


to another member

DB2 DB2 DB2


DB2

CF CF
Connections to failed
member transparently move
to another member

DB2 DB2 DB2 DB2

CF CF
Connections to failed
member transparently move
to another member

20 © 2013 IBM Corporation


DB2 pureScale – No Freeze at All
CF always knows
Member 1 fails what changes are
Member 1 Member 2 Member 3 in flight
CF
x
x
x
x x x
No I/O Freeze
Central Lock
Manager

x x x x
x
x

CF knows what rows on these pages had in-flight updates at time of failure

21 © 2013 IBM Corporation


Q Replication

Admin Utilities

Control Control

Source Capture Apply Target

Log based

WebSphere MQ

• Each message represents a transaction


• Highly parallel apply process
• Differentiated conflict detection and resolution
• Integrated infrastructure for replication and publishing

22 © 2013 IBM Corporation


Continuous Availability Using Q Replication
Read Only Applications

Q Apply Q Capture

Q Capture Q Apply

PROD STBY

Primary Database Secondary Database

Primary Connection Connection Available for Failover

Read/Write Applications

Q
Q Replication
Replication provides
provides aa solution
solution for
for continuous
continuous availability
availability where
where the
the active
active
secondary
secondary system
system isis also
also available
available for
for other
other applications
applications

23 © 2013 IBM Corporation


HA Scenarios

© 2013 IBM Corporation


Local Cluster Failover
DB2 automation with built-in cluster manager

HA Cluster

Primary Local Standby


Database Database

DB1

Pros: Cons:
 Inexpensive local failover solution  No protection from disk failure
 Protection from software and server failure  No protection from site failure
 DB2 9.5 integrated TSA cluster manager  Failover times vary from 1 to 5 minutes

25 © 2013 IBM Corporation


HADR Local or Remote with Read on Standby
Primary Connection

HADR Cluster

Primary Standby
Database Database

DB1 DB1a Read only

Pros: Cons:
 Inexpensive local failover or DR solution  Two full copies of the database (a plus
 Protection from software, server, storage, and site from a redundancy perspective)
failures  Only read transactions can run on
 Simple to setup and monitor standby
 Failover time in the range of 30 sec
 Reporting on standby without increase in failover
time
26 © 2013 IBM Corporation
HADR With Disk Mirroring to Remote DR Site

Primary Connection

HADR Cluster
Automatic client reroute

Primary Local Standby Disaster Recovery


Database Database Site

Remote Disk Mirror


Technology

DB1 DB1a DB1aa

Pros: Cons:
 Very fast local failover with DR capability  Three full copies of the database (a plus
 Protection from software, server, storage, and site from a redundancy perspective)
failures  More costly than HADR for just DR
 Local failover time in the range 30 seconds

27 © 2013 IBM Corporation


HADR With Multiple Standby’s in DB2 10

Primary Connection

HADR Cluster
Automatic client reroute
Disaster Recovery
Primary Local Standby Site
Database Database

DB1 DB1a DB1aa


Remote Standby

Pros: Cons:
 Very fast local failover with DR capability  Three full copies of the database (a plus
 Protection from software, server, storage, and site from a redundancy perspective)
failures
 Allows for time delay on auxiliary standbys
 Local failover time in the range 30 seconds

28 © 2013 IBM Corporation


Q Replication
Multiple alternate standby
servers

Read/Write
Primary Connection write on Site C
standby Site B
Site A
Primary Remote
Database Standby

Q-based SQL replication to


logical standby’s
DB3
DB2
DB1
Pros: Cons:
 Protected from software, server, storage, and site  More complex to setup and monitor (but more
failures flexibility) vs. HADR
 Failover time is “instant”  Asynchronous
 Standby can be full or subset and is fully
accessible (read and/or write)
 Multiple standby servers
29 © 2013 IBM Corporation
High Availability Disaster Recovery (HADR) Options
• Local HA – fast failover with server and storage protection

Site A Server Server


A B

Primary Standby

• Disaster Recovery – server, storage and site protection

Server Any Distance Server


Site A
A B Site B

Primary Standby

• Both – fast local failover with server, storage and site protection

Any Distance

Site A Server Server Server


A B C Site B

Primary Standby Standby


30 © 2013 IBM Corporation
HADR with Replication

• HADR Pairs with Replication

Replication
Server Server Any Distance Server
Site A Server
A B C D Site B

Primary Standby Active Standby

• Delivers:
– Fast Local Failover
– Active / Active DR
– Rolling patch upgrades
– Rolling version upgrades
– Online database on-disk modifications
– Schema modifications online/rolling

31 © 2013 IBM Corporation


DB2 pureScale Availability Options
• Local only
– online recovery, active-active, protection from server failure

Member Member
1 2
Site A CF CF

• Geographically dispersed cluster


– Online recovery, active-active, protection from server, storage and site failure

Member Member Member Member


1 2 3 4
Site A CF CF
< 100km Site B

• Local pureScale plus DR replication


– Online recovery, active-active, protection from server, storage and site failure

Member Member Any distance Member Member


1 2 1 2
CF CF Disk Replication CF CF Site B
Site A or
IBM Replication Server
32 © 2013 IBM Corporation
Summary
• One size does not fit all
• There are many Availability Scenarios each with their own advantages
– Server failover
– HADR
– pureScale
– Q-Replication
• Choose the one that best suits your deployment
– Determine the right solution considering
• Cost (Hardware/Software/Network/Site)
• Availability Requirement
• Management costs
• Application requirements

33 © 2013 IBM Corporation


Thank You
you can reach me @ tnarayan@in.ibm.com

34 © 2013 IBM Corporation

Das könnte Ihnen auch gefallen