Sie sind auf Seite 1von 112

FT NT: A Tutorial on Microsoft Cluster Server

(formerly Wolfpack)
Joe Barrera Jim Gray Microsoft Research
{joebar, gray} @ microsoft.com http://research.microsoft.com/barc

1996, 1997 Microsoft Corp.

Outline

Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A

1996, 1997 Microsoft Corp.

DEPENDABILITY: The 3 ITIES


RELIABILITY / INTEGRITY:

thing.

Does the right

(also large MTTF)


AVAILABILITY: Does it now.

(also small

System Availability:

MTTR ) MTTF+MTTR

If 90% of terminals up & 99% of DB up?


(=>89% of transactions are serviced on time).

Integrity Security Reliability Availability

Holistic vs. Reductionist view

1996, 1997 Microsoft Corp.

"Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe).
Vendor
4 2%

Case Study - Japan


12 %
25% 11.2 % 9.3%

Tele Comm lines Application Software

Environment Operations

Vendor (hardware and software) Application software Communications lines Operations Environment
1,383 institutions reported (6/84 - 7/85) 7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES

5 9 1.5 2 2 10

Months Months Years Years Years Weeks

To Get 10 Year MTTF, Must Attack All These Areas


4

1996, 1997 Microsoft Corp.

Case Studies - Tandem Trends


1 20 1 00 80 60 40 20 0 1 98 5 19 87 1 989

Outag es/ 1000 S yste m Y ears by Primar y Caus e

% of Outage s b y Pri mary Cause


10 0 90 80 70 60 50 40 30 20 10 0 1 98 5 198 7 19 89

unknown

environment

operations

maintenance

hardware

software

MTTF improved Shift from Hardware & Maintenance to from 50% to 10% to Software (62%) & Operations (15%) NOTE: Systematic under-reporting of Environment Operations errors Application Software

1996, 1997 Microsoft Corp.

Summary of FT Studies
Current Situation: ~4-year MTTF =>

Fault Tolerance Works. Hardware is GREAT (maintenance and MTTF). Software masks most hardware faults.
Many hidden software outages in operations: New Software. Utilities. Must make all software ONLINE. Software seems to define a 30-year MTTF ceiling.

Reasonable Goal:

100-year MTTF. class 4 today => class 6 tomorrow.


6

1996, 1997 Microsoft Corp.

Fault Tolerance vs Disaster Tolerance

Fault-Tolerance: mask local faults


RAID disks Uninterruptible Power Supplies Cluster Failover

Disaster Tolerance: masks site failures


Protects against fire, flood, sabotage,.. Redundant system and service at remote

site.

1996, 1997 Microsoft Corp.

The Microsoft Vision: Plug & Play Dependability


Transactions for reliability Clusters: for availability Security All built into the OS

Integrity Security Integrity / Reliability Availability


8

1996, 1997 Microsoft Corp.

Manageability

Cluster Goals

Manage nodes as a single system Perform server maintenance without affecting users Mask faults, so repair is non-disruptive

Availability
Restart failed applications & servers

un-availability ~ MTTR / MTBF , so quick repair.

Detect/warn administrators of failures

Scalability
Add nodes for incremental

processing storage bandwidth

1996, 1997 Microsoft Corp.

Failures are independent

Fault Model

So, single fault tolerance is a big win

Hardware fails fast (blue-screen) Software fails-fast (or goes to sleep) Software often repaired by reboot:
Heisenbugs

Operations tasks: major source of outage


Utility

operations Software upgrades


10

1996, 1997 Microsoft Corp.

Cluster: Servers Combined to Improve Availability & Scalability

together as a single system. Clients see scalable & FT services (single system image). Node: A server in a cluster. May be an SMP server. Interconnect: Communications link used for intracluster status info such as heartbeats. Can be Ethernet.

Cluster: A group of independent systems working

Client PCs

Printers

Server A Disk array A Interconnect Disk array B

Server B

11

1996, 1997 Microsoft Corp.

Microsoft Cluster Server

2-node availability Summer 97 (20,000 Beta Testers now)


Commoditize fault-tolerance (high availability) Commodity hardware (no special hardware) Easy to set up and manage Lots of applications work out of the box.

16-node scalability later (next year?)

12

1996, 1997 Microsoft Corp.

Failover Example
Brow ser

Server 1
Web site Databa se Web site files Database files

Server 2
Web site Databa se

13

1996, 1997 Microsoft Corp.

MS Press Failover Demo


Client/Server Software failure Admin shutdown Server failure

Resource States
- Pending - Partial - Failed

- Offline
14

1996, 1997 Microsoft Corp.

Demo Configuration
Server Betty

Server Alice

SMP Pentium Pro Processors Windows NT Server with Wolfpack Microsoft Internet Information Server Microsoft SQL Server

SMP Pentium Pro Processors Windows NT Server with Wolfpack Microsoft Internet Information Serve Microsoft SQL Server

Local Disks

Interconn ect
SCSI Disk Cabinet
Shared Disks standard Ethernet

Local Disks

Windows NT Server Cluster

Administrator
Windows NT Workstation Cluster Admin SQL Enterprise Mgr

Client

Windows NT Workstatio Internet Explorer MS Press OLTP app

1996, 1997 Microsoft Corp.

Demo Administration
Server Alice
Runs SQL Trace Runs Globe

Server Betty
Run SQL Trace

Local Disks

SCSI Disk Cabinet


Shared Disks

Local Disks

Windows NT Server Cluster

Cluster Admin Console

Windows GUI Shows cluster resource status Replicates status to all servers 1996, 1997 Microsoft Corp.

SQL Enterprise Mgr


Windows GUI Shows server status

Client

Mplay32 is generic app. Registered with MSCS MSCS restarts it on failure Move/restart ~ 2 seconds Fail-over if
4 failures

Generic Stateless Application Rotating Globe

(= process exits) in 3 minutes settable default


17

1996, 1997 Microsoft Corp.

Demo Moving or Failing Over An Application


X X

AVI Applicati Local on SCSI Disk Cabinet Disks Shared Disks

AVI Applicati Local on Disks

Windows NT Server Cluster

Alice Fails or Operator Requests move

1996, 1997 Microsoft Corp.

Notepad saves state on shared disk Failure before save => lost changes Failover or move (disk & state move)

Generic Stateful Application NotePad

19

1996, 1997 Microsoft Corp.

Demo Step 1: Alice Delivering Service


SQL Activity No SQL Activity

SQL
ODBC
Local Disks

SQL
ODBC
SCSI Disk Cabinet
Shared Disks Local Disks

IIS
IP

IIS Windows NT Server Cluster

HTTP

1996, 1997 Microsoft Corp.

2: Request Move to Betty


No SQL Activity SQL Activity

SQL
ODBC
Local Disks

SQL
ODBC
IP
SCSI Disk Cabinet
Shared Disks Local Disks

IIS
IP

IIS Windows NT Server Cluster

HTTP

1996, 1997 Microsoft Corp.

3: Betty Delivering Service


No SQL Activity SQL Activity
.

SQL
ODBC
Local Disks

SQL
ODBC
IP
SCSI Disk Cabinet
Shared Disks Local Disks

IIS

IIS Windows NT Server Cluster

1996, 1997 Microsoft Corp.

4: Power Fail Betty, Alice Takeover


No SQL Activity SQL Activity

SQL
ODBC
Local Disks

SQL
ODBC
IP
SCSI Disk Cabinet
Shared Disks Local Disks

IIS
IP

Windows NT Server Cluster IIS

1996, 1997 Microsoft Corp.

5: Alice Delivering Service


SQL Activity No SQL Activity

SQL
ODBC
Local Disks

SCSI Disk Cabinet


Shared Disks

Local Disks

IIS
IP

Windows NT Server Cluster

HTTP

1996, 1997 Microsoft Corp.

6: Reboot Betty, now can takeover


SQL Activity No SQL Activity

SQL
ODBC
Local Disks

SQL
ODBC
SCSI Disk Cabinet
Shared Disks Local Disks

IIS
IP

IIS Windows NT Server Cluster

HTTP

1996, 1997 Microsoft Corp.

Outline

Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A

26

1996, 1997 Microsoft Corp.

Cluster and NT Abstractions


Cluster Group Resource

Cluster Abstractions

NT Abstractions

Domain

Node

Service
27

1996, 1997 Microsoft Corp.

Basic NT Abstractions
Domain

Node

Service

Service: program or device managed by a node


e.g., file service, print service, database server can depend on other services (startup ordering) can be started, stopped, paused, failed hosts services; belongs to a domain services on node always remain co-located unit of service co-location; involved in naming services cooperation for authentication, administration, naming
28

Node: a single (tightly-coupled) NT system


Domain: a collection of nodes

1996, 1997 Microsoft Corp.

Cluster Abstractions
Cluster

Resource Group

Resource

Resource: program or device managed by a cluster


e.g., file service, print service, database server can depend on other resources (startup ordering) can be online, offline, paused, failed hosts resources; belongs to a cluster unit of co-location; involved in naming resources cooperation for authentication, administration, naming
29

Resource Group: a collection of related resources


Cluster: a collection of nodes, resources, and groups

1996, 1997 Microsoft Corp.

Resources
Cluster Group Resource
Resources have... Type: what it does (file, DB, print, web) An operational state (online/offline/failed) Current and possible nodes Containing Resource Group Dependencies on other resources Restart parameters (in case of resource failure)
30

1996, 1997 Microsoft Corp.

Resource Types

Built-in types

Added by others

Generic Application Generic Service Internet Information Server (IIS) Virtual Root Network Name TCP/IP Address Physical Disk FT Disk (Software RAID) Print Spooler File Share

Microsoft SQL Server, Message Queues, Exchange Mail Server, Oracle, SAP R/3 Your application? (use developer kit wizard).

31

1996, 1997 Microsoft Corp.

Physical Disk

32

1996, 1997 Microsoft Corp.

TCP/IP Address

33

1996, 1997 Microsoft Corp.

Network Name

34

1996, 1997 Microsoft Corp.

File Share

35

1996, 1997 Microsoft Corp.

IIS (WWW/FTP) Server

36

1996, 1997 Microsoft Corp.

Print Spooler

37

1996, 1997 Microsoft Corp.

Resources states:

Resource States
Im Online!
Online Pending Failed Online

Offline: exists, not offering service Online: offering service Failed: not able to offer service

Go Off-line!
Offline Pending

Resource failure may cause:


local restart other resources to go offline resource group to move

Go Online!

Im here!
Offline

Im Off-line!

(all subject to group and resource parameters)

Resource failure detected by:


Polling failure Node failure
38

1996, 1997 Microsoft Corp.

Resource Dependencies

Similar to NT Service Dependencies Orderly startup & shutdown


A resource is brought online after any resources it depends on are online. A Resource is taken offline before any resources it depends on File Share Form dependency trees move among nodes together failover together as per resource group
IIS Virtual Root Network Name

Interdependent resources

IP Address Resource DLL


39

1996, 1997 Microsoft Corp.

Dependencies Tab

40

1996, 1997 Microsoft Corp.

Stores all configuration information


Software Hardware

NT Registry

Hierarchical (name, value) map Has a open, documented interface Is secure Is visible across the net (RPC interface) Typical Entry:
\Software\Microsoft\MSSQLServer\MSSQLServer\ DefaultLogin = GUEST DefaultDomain = REDMOND

41

1996, 1997 Microsoft Corp.

Cluster Registry

Separate from local NT Registry Replicated at each node

Algorithms explained later

Maintains configuration information:


Cluster members Cluster resources Resource and group parameters (e.g. restart)

Stable storage Refreshed from master copy when node joins cluster

42

1996, 1997 Microsoft Corp.

Other Resource Properties


Name Restart policy (restart N times, failover) Startup parameters Private configuration info (resource type specific) Per-node as well, if necessary Poll Intervals (LooksAlive, IsAlive, Timeout) These properties are all kept in Cluster Registry

43

1996, 1997 Microsoft Corp.

General Resource Tab

44

1996, 1997 Microsoft Corp.

Advanced Resource Tab

45

1996, 1997 Microsoft Corp.

Resource Groups
Cluster

Group

Resource
Payroll Group
Web Server SQL Server

Every resource belongs to a

resource group. Resource groups move (failover) as a unit

Dependencies NEVER cross groups. (Dependency trees contained within groups.) Group may contain forest of dependency trees

IP Address

Drive E:

Drive F:
46

1996, 1997 Microsoft Corp.

Moving a Resource Group

47

1996, 1997 Microsoft Corp.

Group Properties

CurrentState: Online, Partially Online, Offline Members: resources that belong to group

members determine which nodes can host group.

Preferred Owners: ordered list of host nodes FailoverThreshold: How many faults cause failover FailoverPeriod: Time window for failover threshold FailbackWindowsStart: When can failback happen? FailbackWindowEnd: When can failback happen? Everything (except CurrentState) is stored in registry
48

1996, 1997 Microsoft Corp.

Failover and Failback

Failover parameters

timeout on LooksAlive, IsAlive # local restarts in failure window after this, offline. (during failback window)

Failback to preferred node

Do resource failures affect group?


Node \\Alice Node \\Betty

Failover
Cluster Cluster Failback Service Service
IPaddr name

49

1996, 1997 Microsoft Corp.

Cluster Concepts Clusters


Cluster Group Group Group Group Resource Resource Resource Resource
50

1996, 1997 Microsoft Corp.

Cluster Properties

Defined Members: nodes that can join the cluster Active Members: nodes currently joined to cluster Resource Groups: groups in a cluster Quorum Resource:
Stores Used

copy of cluster registry.

to form quorum.

Network: Which network used for communication


All properties kept in Cluster Registry
51

1996, 1997 Microsoft Corp.

Cluster API Functions


(operations on nodes & groups)

Find and communicate with Cluster Query/Set Cluster properties Enumerate Cluster objects

Nodes Groups Resources and Resource Types Node state and property changes Group state and property changes Resource state and property changes
52

Cluster Event Notifications


1996, 1997 Microsoft Corp.

Cluster Management

53

1996, 1997 Microsoft Corp.

Demo

Server startup and shutdown Installing applications Changing status Failing over Transferring ownership of groups or resources Deleting Groups and Resources

54

1996, 1997 Microsoft Corp.

Outline

Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A

55

1996, 1997 Microsoft Corp.

Architecture

Top tier provides cluster abstractions Middle tier provides distributed operations Bottom tier is NT and drivers

Failover Manager Resource Monitor Cluster Registry Global Update Quorum Membership Windows NT Server Cluster Disk Driver Cluster Net Drivers
56

1996, 1997 Microsoft Corp.

Membership and Regroup

Membership:
Used

for orderly addition and removal from { active nodes }

Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
57

Regroup:
Used

for failure detection (via heartbeat messages) Forceful eviction from { active nodes }

1996, 1997 Microsoft Corp.

Membership

Defined cluster = all nodes Active cluster:


Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
58

Subset of defined cluster Includes Quorum Resource Stable (no regroup in progress)

1996, 1997 Microsoft Corp.

Quorum Resource

Usually (but not necessarily) a SCSI disk Requirements:


Arbitrates for a resource by supporting the

challenge/defense protocol Capable of storing cluster registry and logs

Configuration Change Logs


Tracks changes to configuration database when

any defined member missing (not active) Prevents configuration partitions in time

59

1996, 1997 Microsoft Corp.

Challenge/Defense Protocol

SCSI-2 has reserve/release verbs


Semaphore on disk controller

Owner gets lease on semaphore Renews lease once every 3 seconds To preempt ownership:
Challenger clears semaphore (SCSI bus reset) Waits 10 seconds

3 seconds for renewal + 2 seconds bus settle time x2 to give owner two chances to renew
If still clear, then former owner loses lease Challenger issues reserve to acquire semaphore
60

1996, 1997 Microsoft Corp.

Challenge/Defense Protocol: Successful Defense


Defender Node
Reserve Reserve Reserve Reserve Reserve

10

11

12

13

14

15

16

Bus Reset

Reservation detected

Challenger Node
61

1996, 1997 Microsoft Corp.

Challenge/Defense Protocol: Successful Challenge


Defender Node
Reserve

10

11

12

13

14

15

16

Bus Reset

Reserve No reservation detected

Challenger Node

62

1996, 1997 Microsoft Corp.

Regroup

Invariant: All members agree on { members } Regroup re-computes { members } Each node sends heartbeat message to a peer (default is one per second) Regroup if two lost heartbeat messages

Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
63

suspicion that sender is dead failure detection in bounded time Checks communication among nodes. Suspected missing node may survive.

Uses a 5-round protocol to agree.


Upper levels (global update, etc.) informed of regroup event.

1996, 1997 Microsoft Corp.

Membership State Machine


Initialize Start Cluster Member Search Found Online Member Minority or no Quorum Sleeping Search Fails Search or Reserve Fails
Quorum Disk Search

Regroup

Acquire (reserve) Quorum Disk Forming

Joining

Non-Minority and Quorum

Lost Heartbeat

Join Succeeds

Online

Synchronize Succeeds
64

1996, 1997 Microsoft Corp.

Joining a Cluster

When a node starts up, it mounts and configures only local, non-cluster devices Starts Cluster Service which
looks in local (stale) registry for members Asks each member in turn to sponsor new nodes

membership. (Stop when sponsor found.)

Sponsor (any active member)


Sponsor authenticates applicant Broadcasts applicant to cluster members Sponsor sends updated registry to applicant Applicant becomes a cluster member
65

1996, 1997 Microsoft Corp.

Forming a Cluster (when Joining fails)


Use registry to find quorum resource Attach to (arbitrate for) quorum resource Update cluster registry from quorum resource
e.g. if we were down when it was in use

Form new one-node cluster Bring other cluster resources online Let others join your cluster

66

1996, 1997 Microsoft Corp.

Leaving A Cluster (Gracefully)

Pause:

Move all groups off this member. Change to paused state (remains a cluster member)

Offline:
Move all groups off this member. Sends ClusterExit message all cluster members

Prevents regroup Prevents stalls during departure transitions


Close Cluster connections

(now not an active cluster member) Cluster service stops on node

Evict: remove node from defined member list


67

1996, 1997 Microsoft Corp.

Leaving a Cluster (Node Failure)


Node (or communication) failure triggers Regroup If after regroup:

Minority group OR no quorum device: group does NOT survive Non-minority group AND quorum device: group DOES survive Number of new members >= 1/2 old active cluster Prevents minority from seizing quorum device at the expense of a larger potentially surviving cluster

Non-Minority rule:

Quorum guarantees correctness Prevents split-brain


e.g. with newly forming cluster containing a single node
68

1996, 1997 Microsoft Corp.

Global Update

Propagates updates to all nodes in cluster Used to maintain replicated cluster registry Updates are atomic and totally ordered Tolerates all benign failures. Depends on membership

Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
69

all are up all can communicate

R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol.

1996, 1997 Microsoft Corp.

Global Update Algorithm

Cluster has locker node that regulates updates.

Oldest active node in cluster

Send Update to locker node Update other (active) nodes

L
00 !

Failure of all updated nodes:


Update never happened Updated nodes will roll back on recovery New locker is oldest and so has update if any do. New locker restarts update

Survival of any updated nodes:


ac k

X= 1

in seniority order (e.g. locker first) this includes the updating node

70

1996, 1997 Microsoft Corp.

Cluster Registry

Separate from local NT Registry Maintains cluster configuration

Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
71

members, resources, restart parameters, etc.

Stable storage Replicated at each member


Global Update protocol NT Registry keeps local copy

1996, 1997 Microsoft Corp.

Cluster Registry Bootstrapping

Membership uses Cluster Registry for list of nodes


Circular dependency

Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
72

Solution:
Membership uses stale

local cluster registry Refresh after joining or forming cluster Master is either
quorum device, or active members

1996, 1997 Microsoft Corp.

Resource Monitor

Polls resources:
IsAlive and LooksAlive

Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
73

Detects failures
polling failure failure event from resource

Higher levels tell it


Online, Offline Restart

1996, 1997 Microsoft Corp.

Failover Manager

Assigns groups to nodes based on


Failover

Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
74

parameters Possible nodes for each resource in group Preferred nodes for resource group

1996, 1997 Microsoft Corp.

Failover (Resource Goes Offline)


Resource Manager Detects resource error. Notify Failover Manager. Failover Manager checks: Failover Window and Failover Threshold Attempt to restart resource. Wait for Failback Window Are Failover conditions within Constraints? No

No

Has the Resource Retry limit been exceeded?

Yes Yes Switch resource (and Dependants) Offline. Can another owner be found? (Arbitration) Yes No

Leave Group in partially Online state.

Notify Failover Manager on the new system to bring resource Online.

75

1996, 1997 Microsoft Corp.

Pushing a Group (Resource Failure)


Resource Monitor notifies Resource Manager of resource failure. Resource Manager enumerates all objects in the Dependency Tree of the failed resource. Resource Manager notifies Failover Manager that the Dependency Tree is Offline and needs to fail over.

Resource Manager takes each depending resource Offline.

Failover Manager performs Arbitration to locate a new owner for the group.

Leave Group in partially Online state.

No

Any resource has Affect the Group True

Yes

Failover Manager on the new owner node brings the resources Online.

76

1996, 1997 Microsoft Corp.

Pulling a Group (Node Failure)


Cluster Service notifies Failover Manager of node failure.

Failover Manager determines which groups were owned by the failed node.

Failover Manager performs Arbitration to locate a new owner for the groups.

Resource Manager notifies Failover Manager that the node is Offline and the groups it owned need to fail over.

Failover Manager on the new owner(s) bring the resources Online in dependency order.

77

1996, 1997 Microsoft Corp.

Failback to Preferred Owner Node


Group may have a Preferred Owner Preferred Owner comes back online Will only occur during the Failback Window (time slot, e.g. at night)
Preferred owner comes back Online. Resource Manager takes each resource on the current owner Offline.

Is the time within the Failback Window?

Resource Manager notifies Failover Manager that the Group is Offline and needs to fail over to the Preferred Owner.

Failover Manager performs Arbitration to locate the Preferred Owner of the group.

Failover Manager on the Preferred Owner brings the resources Online.

78

1996, 1997 Microsoft Corp.

Outline

Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A

79

1996, 1997 Microsoft Corp.

Process Structure

Cluster Service

A Node

Failover Manager Cluster Registry Global Update Quorum Membership Resource Monitor Service Resource DLLs

Resource Monitor
Private calls

Resource Monitor Cluster


Resources

Resource Monitor DLL


Private calls

Services Applications

Resource
80

1996, 1997 Microsoft Corp.

Resource Control

Commands

CreateResource() OnlineResource() OfflineResource() TerminateResource() CloseResource() ShutdownProcess()

A Node

Resource Monitor

And resource events

Cluster Service
Resource Monitor DLL

Private calls

Private calls

Resource
81

1996, 1997 Microsoft Corp.

Resource DLLs

Im Online!
Online Pending

Online

Go Off-line!

Calls to Resource DLL


Failed

Offline Pending

Open: get handle Online: start offering service Offline: stop offering service
as a standby or pair-is offline

Go Online!

Im here!
Offline

Im Off-line!

LooksAlive: Quick check IsAlive: Thorough check Terminate: Forceful Offline Close: release handle

Resource Monitor DLL


Std calls

Private calls

Resource
82

1996, 1997 Microsoft Corp.

Cluster Communications

Most communication via DCOM /RPC UDP used for membership heartbeat messages Standard (e.g. Ethernet) interconnects
Management apps DCOM Cluster Service DCOM / RPC Resource Monitors Resource Monitors DCOM / RPC: admin UDP: Heartbeat DCOM Cluster Service DCOM / RPC Resource Monitors Resource Monitors
83

1996, 1997 Microsoft Corp.

Outline

Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A

84

1996, 1997 Microsoft Corp.

Application Support

Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API

85

1996, 1997 Microsoft Corp.

Virtual Servers

Problem:

Client and Server Applications do not want node name to change when server app moves to another node. Resource Group (name, disks, databases,) NetName and IP address (node: \\a keeps name and IP address as is moves) Virtual Registry (registry moves (is replicated)) Virtual Service Control Virtual RPC service Limit app to virtual servers devices and services. Client reconnect on failover (easy if connectionless -- eg web-clients)

A Virtual Server simulates an NT Node


Virtual Server \\a:1.2.3.4

Challenges:

Virtual Server \\a: 1.2.3.4

86

1996, 1997 Microsoft Corp.

Virtual Servers (before failover)


Nodes \\Y and \\Z support virtual servers \\A and \\B Things that need to fail over transparently
Client connection Server dependencies Service names Binding to local resources Binding to local servers

SAP SQL S:\ \\A

\\Y

\\Z

SAP SQL T:\ \\B

SAP on A

SAP on B

87

1996, 1997 Microsoft Corp.

Virtual Servers (just after failover)


\\Y resources and groups (i.e. Virtual Server \\A) moved to \\Z A resources bind to each other and to local resources (e.g., local file system)

\\Y

\\Z
SAP SQL S:\ SAP SQL T:\

Registry Physical resource Security domain Time

\\A

\\B

Transactions used to make DB state consistent. To work, local resources on \\Y and \\Z have to be similar SAP on A

SAP on B

E.g. time must remain monotonic after failover


88

1996, 1997 Microsoft Corp.

Address Failover and Client Reconnection

Name and Address rebind to new node

\\Y

\\Z
SAP SQL S:\ SAP SQL T:\

Details later Failure not transparent Must log on again Client context lost (encourages connectionless) Applications could maintain context
SAP on A

Clients reconnect

\\A

\\B

SAP on B
89

1996, 1997 Microsoft Corp.

Mapping Local References to Group-Relative References

Send client requests to correct server


\\Y

\\Z
SAP SQL S:\ SAP SQL T:\

\\A\SAP refers to \\.\SQL \\B\SAP refers to \\.\SQL \\A\SAP to \\.\SQL$A \\B\SAP to \\.\SQL$B

Must remap references:


Also handles namespace collision Done via


\\A

\\B

modifying server apps, or DLLs to transparently rename

SAP on A

SAP on B
90

1996, 1997 Microsoft Corp.

Naming and Binding and Failover

Services rely on the NT node name and - or IP address to advertise Shares, Printers, and Services. Applications register names to advertise services Example: \\Alice\SQL (i.e. <node><service>) Example: 128.2.2.2:80 (=http://www.foo.com/) Binding Clients bind to an address (e.g. name->IP address) Thus the node name and IP address must failover along with the services (preserve client bindings)

91

1996, 1997 Microsoft Corp.

Client to Cluster Communications IP address mobility based on MAC rebinding


IP rebinds to failover MAC addr Transparent to client or server Low-level ARP (address resolution protocol) rebinds IP add to new MAC addr.
Client Alice <-> 200.110.12.4 Virtual Alice <-> 200.110.12.5 Betty <-> 200.110.12.6 Virtual Betty <-> 200.110.12.7

Cluster Clients Must use IP (TCP, UDP, NBT,... ) Must Reconnect or Retry after failure Cluster Servers All cluster nodes must be on same LAN segment

Alice <-> 200.110.120.4 Virtual Alice <-> 200.110.120.5

WAN Router:
200.110.120.4 ->AliceMAC 200.110.120.5 ->AliceMAC 200.110.120.6 ->BettyMAC 200.110.120.7 ->BettyMAC

Betty <-> 200.110.120.6 Virtual Betty <-> 200.110.120.7

Local Network

92

1996, 1997 Microsoft Corp.

Time

Time must increase monotonically


Otherwise applications get confused e.g. make/nmake/build Not hard, since failover on order of seconds

Time is maintained within failover resolution

Time is a resource, so one node owns time resource Other nodes periodically correct drift from owners time

93

1996, 1997 Microsoft Corp.

Application Local NT Registry Checkpointing


Resources can request that local NT registry subtrees be replicated Changes written out to quorum device
Uses registry change notification interface

Changes read and applied on fail-over


\\A on \\X registry
Eac h up da te

\\A on \\B
l ov Fai er Aft er

registry

registry
Quorum Device

94

1996, 1997 Microsoft Corp.

Registry Replication

95

1996, 1997 Microsoft Corp.

Application Support

Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API

96

1996, 1997 Microsoft Corp.

Generic Resource DLLs

Generic Application DLL


Simplest: just starts, stops application, and

makes sure process is alive

Generic Service DLL


Translates DLL calls into equivalent NT

Server calls

Online => Service Start Offline => Service Stop Looks/IsAlive => Service Status
Resource Monitor DLL Private
Std calls

calls

Resource
97

1996, 1997 Microsoft Corp.

Generic Application

98

1996, 1997 Microsoft Corp.

Generic Service

99

1996, 1997 Microsoft Corp.

Application Support

Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API

100

1996, 1997 Microsoft Corp.

Resource DLL VC++ Wizard


Asks for resource type name Asks for optional service to control Asks for other parameters (and associated types) Generates DLL source code Source can be modified as necessary
E.g. additional checks for Looks/IsAlive

101

1996, 1997 Microsoft Corp.

Creating a New Workspace

102

1996, 1997 Microsoft Corp.

Specifying Resource Type Name

103

1996, 1997 Microsoft Corp.

Specifying Resource Parameters

104

1996, 1997 Microsoft Corp.

Automatic Code Generation

105

1996, 1997 Microsoft Corp.

Customizing The Code

106

1996, 1997 Microsoft Corp.

Application Support

Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API

107

1996, 1997 Microsoft Corp.

Cluster API

Allows resources to:


Examine dependencies Manage per-resource data Change parameters (e.g. failover) Listen for cluster events etc.

Specs & API became public Sept 1996 On all MSDN Level 3 On web site:
http://www.microsoft.com/clustering.htm
108

1996, 1997 Microsoft Corp.

Cluster API Documentation

109

1996, 1997 Microsoft Corp.

Outline

Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A

110

1996, 1997 Microsoft Corp.

Research Topics?

Even easier to manage Transparent failover Instant failover Geographic distribution (disaster tolerance) Server pools (load-balanced pool of processes) Process pair (active/backup process) 10,000 nodes? Better algorithms Shared memory or shared disk among nodes
a truly bad idea?
111

1996, 1997 Microsoft Corp.

Microsoft NT site: http://www.microsoft.com/ntserver/ BARC site: http://research.microsoft.com/BARC These slides: http://research.microsoft.com/~joebar/ftcs-27/ftcs20.ppt Inside Windows NT,
H. Custer, Microsoft Pr, ISBN: 155615481

References

Tandem Global Update Protocol,

R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol. Kronenberg, N., Levey, H., Strecker, W., ACM TOCS, V 4.2 1986. A (the) shared disk cluster. Gregory F. Pfister, Prentice Hall, 1995, ISBN: 0134376250. Argues for shared nothing Gray, J., Reuter A., Morgan Kaufmann, 1994. ISBN 1558601902, survey of outages, transaction techniques.
112

VAXclusters: a Closely Coupled Distributed System,

In Search of Clusters : The Coming Battle in Lowly Parallel Computing,

Transaction Processing Concepts and Techniques,

1996, 1997 Microsoft Corp.

Das könnte Ihnen auch gefallen