Beruflich Dokumente
Kultur Dokumente
(formerly Wolfpack)
Joe Barrera Jim Gray Microsoft Research
{joebar, gray} @ microsoft.com http://research.microsoft.com/barc
Outline
Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A
thing.
(also small
System Availability:
MTTR ) MTTF+MTTR
"Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe).
Vendor
4 2%
Environment Operations
Vendor (hardware and software) Application software Communications lines Operations Environment
1,383 institutions reported (6/84 - 7/85) 7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES
5 9 1.5 2 2 10
unknown
environment
operations
maintenance
hardware
software
MTTF improved Shift from Hardware & Maintenance to from 50% to 10% to Software (62%) & Operations (15%) NOTE: Systematic under-reporting of Environment Operations errors Application Software
Summary of FT Studies
Current Situation: ~4-year MTTF =>
Fault Tolerance Works. Hardware is GREAT (maintenance and MTTF). Software masks most hardware faults.
Many hidden software outages in operations: New Software. Utilities. Must make all software ONLINE. Software seems to define a 30-year MTTF ceiling.
Reasonable Goal:
site.
Transactions for reliability Clusters: for availability Security All built into the OS
Manageability
Cluster Goals
Manage nodes as a single system Perform server maintenance without affecting users Mask faults, so repair is non-disruptive
Availability
Restart failed applications & servers
Scalability
Add nodes for incremental
Fault Model
Hardware fails fast (blue-screen) Software fails-fast (or goes to sleep) Software often repaired by reboot:
Heisenbugs
together as a single system. Clients see scalable & FT services (single system image). Node: A server in a cluster. May be an SMP server. Interconnect: Communications link used for intracluster status info such as heartbeats. Can be Ethernet.
Client PCs
Printers
Server B
11
12
Failover Example
Brow ser
Server 1
Web site Databa se Web site files Database files
Server 2
Web site Databa se
13
Resource States
- Pending - Partial - Failed
- Offline
14
Demo Configuration
Server Betty
Server Alice
SMP Pentium Pro Processors Windows NT Server with Wolfpack Microsoft Internet Information Server Microsoft SQL Server
SMP Pentium Pro Processors Windows NT Server with Wolfpack Microsoft Internet Information Serve Microsoft SQL Server
Local Disks
Interconn ect
SCSI Disk Cabinet
Shared Disks standard Ethernet
Local Disks
Administrator
Windows NT Workstation Cluster Admin SQL Enterprise Mgr
Client
Demo Administration
Server Alice
Runs SQL Trace Runs Globe
Server Betty
Run SQL Trace
Local Disks
Local Disks
Windows GUI Shows cluster resource status Replicates status to all servers 1996, 1997 Microsoft Corp.
Client
Mplay32 is generic app. Registered with MSCS MSCS restarts it on failure Move/restart ~ 2 seconds Fail-over if
4 failures
Notepad saves state on shared disk Failure before save => lost changes Failover or move (disk & state move)
19
SQL
ODBC
Local Disks
SQL
ODBC
SCSI Disk Cabinet
Shared Disks Local Disks
IIS
IP
HTTP
SQL
ODBC
Local Disks
SQL
ODBC
IP
SCSI Disk Cabinet
Shared Disks Local Disks
IIS
IP
HTTP
SQL
ODBC
Local Disks
SQL
ODBC
IP
SCSI Disk Cabinet
Shared Disks Local Disks
IIS
SQL
ODBC
Local Disks
SQL
ODBC
IP
SCSI Disk Cabinet
Shared Disks Local Disks
IIS
IP
SQL
ODBC
Local Disks
Local Disks
IIS
IP
HTTP
SQL
ODBC
Local Disks
SQL
ODBC
SCSI Disk Cabinet
Shared Disks Local Disks
IIS
IP
HTTP
Outline
Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A
26
Cluster Abstractions
NT Abstractions
Domain
Node
Service
27
Basic NT Abstractions
Domain
Node
Service
Cluster Abstractions
Cluster
Resource Group
Resource
Resources
Cluster Group Resource
Resources have... Type: what it does (file, DB, print, web) An operational state (online/offline/failed) Current and possible nodes Containing Resource Group Dependencies on other resources Restart parameters (in case of resource failure)
30
Resource Types
Built-in types
Added by others
Generic Application Generic Service Internet Information Server (IIS) Virtual Root Network Name TCP/IP Address Physical Disk FT Disk (Software RAID) Print Spooler File Share
Microsoft SQL Server, Message Queues, Exchange Mail Server, Oracle, SAP R/3 Your application? (use developer kit wizard).
31
Physical Disk
32
TCP/IP Address
33
Network Name
34
File Share
35
36
Print Spooler
37
Resources states:
Resource States
Im Online!
Online Pending Failed Online
Offline: exists, not offering service Online: offering service Failed: not able to offer service
Go Off-line!
Offline Pending
Go Online!
Im here!
Offline
Im Off-line!
Resource Dependencies
A resource is brought online after any resources it depends on are online. A Resource is taken offline before any resources it depends on File Share Form dependency trees move among nodes together failover together as per resource group
IIS Virtual Root Network Name
Interdependent resources
Dependencies Tab
40
NT Registry
Hierarchical (name, value) map Has a open, documented interface Is secure Is visible across the net (RPC interface) Typical Entry:
\Software\Microsoft\MSSQLServer\MSSQLServer\ DefaultLogin = GUEST DefaultDomain = REDMOND
41
Cluster Registry
Stable storage Refreshed from master copy when node joins cluster
42
Name Restart policy (restart N times, failover) Startup parameters Private configuration info (resource type specific) Per-node as well, if necessary Poll Intervals (LooksAlive, IsAlive, Timeout) These properties are all kept in Cluster Registry
43
44
45
Resource Groups
Cluster
Group
Resource
Payroll Group
Web Server SQL Server
Dependencies NEVER cross groups. (Dependency trees contained within groups.) Group may contain forest of dependency trees
IP Address
Drive E:
Drive F:
46
47
Group Properties
CurrentState: Online, Partially Online, Offline Members: resources that belong to group
Preferred Owners: ordered list of host nodes FailoverThreshold: How many faults cause failover FailoverPeriod: Time window for failover threshold FailbackWindowsStart: When can failback happen? FailbackWindowEnd: When can failback happen? Everything (except CurrentState) is stored in registry
48
Failover parameters
timeout on LooksAlive, IsAlive # local restarts in failure window after this, offline. (during failback window)
Failover
Cluster Cluster Failback Service Service
IPaddr name
49
Cluster Properties
Defined Members: nodes that can join the cluster Active Members: nodes currently joined to cluster Resource Groups: groups in a cluster Quorum Resource:
Stores Used
to form quorum.
Find and communicate with Cluster Query/Set Cluster properties Enumerate Cluster objects
Nodes Groups Resources and Resource Types Node state and property changes Group state and property changes Resource state and property changes
52
Cluster Management
53
Demo
Server startup and shutdown Installing applications Changing status Failing over Transferring ownership of groups or resources Deleting Groups and Resources
54
Outline
Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A
55
Architecture
Top tier provides cluster abstractions Middle tier provides distributed operations Bottom tier is NT and drivers
Failover Manager Resource Monitor Cluster Registry Global Update Quorum Membership Windows NT Server Cluster Disk Driver Cluster Net Drivers
56
Membership:
Used
Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
57
Regroup:
Used
for failure detection (via heartbeat messages) Forceful eviction from { active nodes }
Membership
Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
58
Subset of defined cluster Includes Quorum Resource Stable (no regroup in progress)
Quorum Resource
any defined member missing (not active) Prevents configuration partitions in time
59
Challenge/Defense Protocol
Owner gets lease on semaphore Renews lease once every 3 seconds To preempt ownership:
Challenger clears semaphore (SCSI bus reset) Waits 10 seconds
3 seconds for renewal + 2 seconds bus settle time x2 to give owner two chances to renew
If still clear, then former owner loses lease Challenger issues reserve to acquire semaphore
60
10
11
12
13
14
15
16
Bus Reset
Reservation detected
Challenger Node
61
10
11
12
13
14
15
16
Bus Reset
Challenger Node
62
Regroup
Invariant: All members agree on { members } Regroup re-computes { members } Each node sends heartbeat message to a peer (default is one per second) Regroup if two lost heartbeat messages
Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
63
suspicion that sender is dead failure detection in bounded time Checks communication among nodes. Suspected missing node may survive.
Regroup
Joining
Lost Heartbeat
Join Succeeds
Online
Synchronize Succeeds
64
Joining a Cluster
When a node starts up, it mounts and configures only local, non-cluster devices Starts Cluster Service which
looks in local (stale) registry for members Asks each member in turn to sponsor new nodes
Use registry to find quorum resource Attach to (arbitrate for) quorum resource Update cluster registry from quorum resource
e.g. if we were down when it was in use
Form new one-node cluster Bring other cluster resources online Let others join your cluster
66
Pause:
Move all groups off this member. Change to paused state (remains a cluster member)
Offline:
Move all groups off this member. Sends ClusterExit message all cluster members
Minority group OR no quorum device: group does NOT survive Non-minority group AND quorum device: group DOES survive Number of new members >= 1/2 old active cluster Prevents minority from seizing quorum device at the expense of a larger potentially surviving cluster
Non-Minority rule:
Global Update
Propagates updates to all nodes in cluster Used to maintain replicated cluster registry Updates are atomic and totally ordered Tolerates all benign failures. Depends on membership
Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
69
R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol.
L
00 !
Update never happened Updated nodes will roll back on recovery New locker is oldest and so has update if any do. New locker restarts update
ac k
X= 1
in seniority order (e.g. locker first) this includes the updating node
70
Cluster Registry
Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
71
Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
72
Solution:
Membership uses stale
local cluster registry Refresh after joining or forming cluster Master is either
quorum device, or active members
Resource Monitor
Polls resources:
IsAlive and LooksAlive
Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
73
Detects failures
polling failure failure event from resource
Failover Manager
Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
74
parameters Possible nodes for each resource in group Preferred nodes for resource group
No
Yes Yes Switch resource (and Dependants) Offline. Can another owner be found? (Arbitration) Yes No
75
Failover Manager performs Arbitration to locate a new owner for the group.
No
Yes
Failover Manager on the new owner node brings the resources Online.
76
Failover Manager determines which groups were owned by the failed node.
Failover Manager performs Arbitration to locate a new owner for the groups.
Resource Manager notifies Failover Manager that the node is Offline and the groups it owned need to fail over.
Failover Manager on the new owner(s) bring the resources Online in dependency order.
77
Group may have a Preferred Owner Preferred Owner comes back online Will only occur during the Failback Window (time slot, e.g. at night)
Preferred owner comes back Online. Resource Manager takes each resource on the current owner Offline.
Resource Manager notifies Failover Manager that the Group is Offline and needs to fail over to the Preferred Owner.
Failover Manager performs Arbitration to locate the Preferred Owner of the group.
78
Outline
Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A
79
Process Structure
Cluster Service
A Node
Failover Manager Cluster Registry Global Update Quorum Membership Resource Monitor Service Resource DLLs
Resource Monitor
Private calls
Resources
Services Applications
Resource
80
Resource Control
Commands
A Node
Resource Monitor
Cluster Service
Resource Monitor DLL
Private calls
Private calls
Resource
81
Resource DLLs
Im Online!
Online Pending
Online
Go Off-line!
Failed
Offline Pending
Open: get handle Online: start offering service Offline: stop offering service
as a standby or pair-is offline
Go Online!
Im here!
Offline
Im Off-line!
LooksAlive: Quick check IsAlive: Thorough check Terminate: Forceful Offline Close: release handle
Private calls
Resource
82
Cluster Communications
Most communication via DCOM /RPC UDP used for membership heartbeat messages Standard (e.g. Ethernet) interconnects
Management apps DCOM Cluster Service DCOM / RPC Resource Monitors Resource Monitors DCOM / RPC: admin UDP: Heartbeat DCOM Cluster Service DCOM / RPC Resource Monitors Resource Monitors
83
Outline
Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A
84
Application Support
Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API
85
Virtual Servers
Problem:
Client and Server Applications do not want node name to change when server app moves to another node. Resource Group (name, disks, databases,) NetName and IP address (node: \\a keeps name and IP address as is moves) Virtual Registry (registry moves (is replicated)) Virtual Service Control Virtual RPC service Limit app to virtual servers devices and services. Client reconnect on failover (easy if connectionless -- eg web-clients)
Challenges:
86
Nodes \\Y and \\Z support virtual servers \\A and \\B Things that need to fail over transparently
Client connection Server dependencies Service names Binding to local resources Binding to local servers
\\Y
\\Z
SAP on A
SAP on B
87
\\Y resources and groups (i.e. Virtual Server \\A) moved to \\Z A resources bind to each other and to local resources (e.g., local file system)
\\Y
\\Z
SAP SQL S:\ SAP SQL T:\
\\A
\\B
Transactions used to make DB state consistent. To work, local resources on \\Y and \\Z have to be similar SAP on A
SAP on B
\\Y
\\Z
SAP SQL S:\ SAP SQL T:\
Details later Failure not transparent Must log on again Client context lost (encourages connectionless) Applications could maintain context
SAP on A
Clients reconnect
\\A
\\B
SAP on B
89
\\Y
\\Z
SAP SQL S:\ SAP SQL T:\
\\A\SAP refers to \\.\SQL \\B\SAP refers to \\.\SQL \\A\SAP to \\.\SQL$A \\B\SAP to \\.\SQL$B
\\A
\\B
SAP on A
SAP on B
90
Services rely on the NT node name and - or IP address to advertise Shares, Printers, and Services. Applications register names to advertise services Example: \\Alice\SQL (i.e. <node><service>) Example: 128.2.2.2:80 (=http://www.foo.com/) Binding Clients bind to an address (e.g. name->IP address) Thus the node name and IP address must failover along with the services (preserve client bindings)
91
IP rebinds to failover MAC addr Transparent to client or server Low-level ARP (address resolution protocol) rebinds IP add to new MAC addr.
Client Alice <-> 200.110.12.4 Virtual Alice <-> 200.110.12.5 Betty <-> 200.110.12.6 Virtual Betty <-> 200.110.12.7
Cluster Clients Must use IP (TCP, UDP, NBT,... ) Must Reconnect or Retry after failure Cluster Servers All cluster nodes must be on same LAN segment
WAN Router:
200.110.120.4 ->AliceMAC 200.110.120.5 ->AliceMAC 200.110.120.6 ->BettyMAC 200.110.120.7 ->BettyMAC
Local Network
92
Time
Otherwise applications get confused e.g. make/nmake/build Not hard, since failover on order of seconds
Time is a resource, so one node owns time resource Other nodes periodically correct drift from owners time
93
Resources can request that local NT registry subtrees be replicated Changes written out to quorum device
Uses registry change notification interface
\\A on \\B
l ov Fai er Aft er
registry
registry
Quorum Device
94
Registry Replication
95
Application Support
Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API
96
Server calls
Online => Service Start Offline => Service Stop Looks/IsAlive => Service Status
Resource Monitor DLL Private
Std calls
calls
Resource
97
Generic Application
98
Generic Service
99
Application Support
Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API
100
Asks for resource type name Asks for optional service to control Asks for other parameters (and associated types) Generates DLL source code Source can be modified as necessary
E.g. additional checks for Looks/IsAlive
101
102
103
104
105
106
Application Support
Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API
107
Cluster API
Specs & API became public Sept 1996 On all MSDN Level 3 On web site:
http://www.microsoft.com/clustering.htm
108
109
Outline
Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A
110
Research Topics?
Even easier to manage Transparent failover Instant failover Geographic distribution (disaster tolerance) Server pools (load-balanced pool of processes) Process pair (active/backup process) 10,000 nodes? Better algorithms Shared memory or shared disk among nodes
a truly bad idea?
111
Microsoft NT site: http://www.microsoft.com/ntserver/ BARC site: http://research.microsoft.com/BARC These slides: http://research.microsoft.com/~joebar/ftcs-27/ftcs20.ppt Inside Windows NT,
H. Custer, Microsoft Pr, ISBN: 155615481
References
R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol. Kronenberg, N., Levey, H., Strecker, W., ACM TOCS, V 4.2 1986. A (the) shared disk cluster. Gregory F. Pfister, Prentice Hall, 1995, ISBN: 0134376250. Argues for shared nothing Gray, J., Reuter A., Morgan Kaufmann, 1994. ISBN 1558601902, survey of outages, transaction techniques.
112