Ftcs 20

FT NT: A Tutorial on Microsoft Cluster Server
(formerly Wolfpack)
Joe Barrera Jim Gray Microsoft Research
{joebar, gray} @ microsoft.com http://research.microsoft.com/barc
1996, 1997 Microsoft Corp.
Outline

Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A
DEPENDABILITY: The 3 ITIES

RELIABILITY / INTEGRITY:
thing.
Does the right
(also large MTTF)

AVAILABILITY: Does it now.
(also small
System Availability:
MTTR ) MTTF+MTTR
If 90% of terminals up & 99% of DB up?

(=>89% of transactions are serviced on time).
Integrity Security Reliability Availability
Holistic vs. Reductionist view
"Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe).
Vendor
4 2%
Case Study - Japan

12 %
25% 11.2 % 9.3%
Tele Comm lines Application Software
Environment Operations
Vendor (hardware and software) Application software Communications lines Operations Environment
1,383 institutions reported (6/84 - 7/85) 7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES
5 9 1.5 2 2 10
Months Months Years Years Years Weeks
To Get 10 Year MTTF, Must Attack All These Areas

4
Case Studies - Tandem Trends

1 20 1 00 80 60 40 20 0 1 98 5 19 87 1 989
Outag es/ 1000 S yste m Y ears by Primar y Caus e
% of Outage s b y Pri mary Cause

10 0 90 80 70 60 50 40 30 20 10 0 1 98 5 198 7 19 89
unknown
environment
operations
maintenance
hardware
software
MTTF improved Shift from Hardware & Maintenance to from 50% to 10% to Software (62%) & Operations (15%) NOTE: Systematic under-reporting of Environment Operations errors Application Software
Summary of FT Studies
Current Situation: ~4-year MTTF =>
Fault Tolerance Works. Hardware is GREAT (maintenance and MTTF). Software masks most hardware faults.
Many hidden software outages in operations: New Software. Utilities. Must make all software ONLINE. Software seems to define a 30-year MTTF ceiling.
Reasonable Goal:
100-year MTTF. class 4 today => class 6 tomorrow.

6
Fault Tolerance vs Disaster Tolerance
Fault-Tolerance: mask local faults

RAID disks Uninterruptible Power Supplies Cluster Failover
Disaster Tolerance: masks site failures

Protects against fire, flood, sabotage,.. Redundant system and service at remote
site.
The Microsoft Vision: Plug & Play Dependability

Transactions for reliability Clusters: for availability Security All built into the OS
Integrity Security Integrity / Reliability Availability

8
Manageability
Cluster Goals
Manage nodes as a single system Perform server maintenance without affecting users Mask faults, so repair is non-disruptive
Availability
Restart failed applications & servers
un-availability ~ MTTR / MTBF , so quick repair.
Detect/warn administrators of failures
Scalability
Add nodes for incremental
processing storage bandwidth
Failures are independent
Fault Model
So, single fault tolerance is a big win
Hardware fails fast (blue-screen) Software fails-fast (or goes to sleep) Software often repaired by reboot:
Heisenbugs
Operations tasks: major source of outage

Utility
operations Software upgrades

10
Cluster: Servers Combined to Improve Availability & Scalability
together as a single system. Clients see scalable & FT services (single system image). Node: A server in a cluster. May be an SMP server. Interconnect: Communications link used for intracluster status info such as heartbeats. Can be Ethernet.
Cluster: A group of independent systems working
Client PCs
Printers
Server A Disk array A Interconnect Disk array B
Server B
11
Microsoft Cluster Server
2-node availability Summer 97 (20,000 Beta Testers now)

Commoditize fault-tolerance (high availability) Commodity hardware (no special hardware) Easy to set up and manage Lots of applications work out of the box.
16-node scalability later (next year?)
12
Failover Example
Brow ser
Server 1
Web site Databa se Web site files Database files
Server 2
Web site Databa se
13
MS Press Failover Demo

Client/Server Software failure Admin shutdown Server failure
Resource States
- Pending - Partial - Failed
- Offline
14
Demo Configuration
Server Betty

Server Alice
SMP Pentium Pro Processors Windows NT Server with Wolfpack Microsoft Internet Information Server Microsoft SQL Server
SMP Pentium Pro Processors Windows NT Server with Wolfpack Microsoft Internet Information Serve Microsoft SQL Server
Local Disks
Interconn ect
SCSI Disk Cabinet
Shared Disks standard Ethernet
Local Disks
Windows NT Server Cluster
Administrator
Windows NT Workstation Cluster Admin SQL Enterprise Mgr
Client
Windows NT Workstatio Internet Explorer MS Press OLTP app
Demo Administration
Server Alice
Runs SQL Trace Runs Globe
Server Betty
Run SQL Trace
Local Disks
SCSI Disk Cabinet

Shared Disks
Local Disks
Cluster Admin Console
Windows GUI Shows cluster resource status Replicates status to all servers 1996, 1997 Microsoft Corp.
SQL Enterprise Mgr

Windows GUI Shows server status
Client
Mplay32 is generic app. Registered with MSCS MSCS restarts it on failure Move/restart ~ 2 seconds Fail-over if
4 failures
Generic Stateless Application Rotating Globe
(= process exits) in 3 minutes settable default

17
Demo Moving or Failing Over An Application

X X
AVI Applicati Local on SCSI Disk Cabinet Disks Shared Disks
AVI Applicati Local on Disks
Alice Fails or Operator Requests move
Notepad saves state on shared disk Failure before save => lost changes Failover or move (disk & state move)
Generic Stateful Application NotePad
19
Demo Step 1: Alice Delivering Service

SQL Activity No SQL Activity
SQL
ODBC
Local Disks
SQL
ODBC
SCSI Disk Cabinet
Shared Disks Local Disks
IIS
IP
IIS Windows NT Server Cluster
HTTP
2: Request Move to Betty

No SQL Activity SQL Activity
SQL
ODBC
Local Disks
SQL
ODBC
IP
SCSI Disk Cabinet
IIS
IP
HTTP
3: Betty Delivering Service

.
SQL
ODBC
Local Disks
SQL
ODBC
IP
SCSI Disk Cabinet
IIS
4: Power Fail Betty, Alice Takeover

SQL
ODBC
Local Disks
SQL
ODBC
IP
SCSI Disk Cabinet
IIS
IP
Windows NT Server Cluster IIS
5: Alice Delivering Service

SQL
ODBC
Local Disks
SCSI Disk Cabinet

Shared Disks
Local Disks
IIS
IP
HTTP
6: Reboot Betty, now can takeover

SQL
ODBC
Local Disks
SQL
ODBC
SCSI Disk Cabinet
IIS
IP
HTTP
Outline

26
Cluster and NT Abstractions

Cluster Group Resource
Cluster Abstractions
NT Abstractions
Domain
Node
Service
27
Basic NT Abstractions
Domain
Node
Service
Service: program or device managed by a node

e.g., file service, print service, database server can depend on other services (startup ordering) can be started, stopped, paused, failed hosts services; belongs to a domain services on node always remain co-located unit of service co-location; involved in naming services cooperation for authentication, administration, naming
28
Node: a single (tightly-coupled) NT system

Domain: a collection of nodes
Cluster Abstractions
Cluster
Resource Group
Resource
Resource: program or device managed by a cluster

e.g., file service, print service, database server can depend on other resources (startup ordering) can be online, offline, paused, failed hosts resources; belongs to a cluster unit of co-location; involved in naming resources cooperation for authentication, administration, naming
29
Resource Group: a collection of related resources

Cluster: a collection of nodes, resources, and groups
Resources
Cluster Group Resource
Resources have... Type: what it does (file, DB, print, web) An operational state (online/offline/failed) Current and possible nodes Containing Resource Group Dependencies on other resources Restart parameters (in case of resource failure)
30
Resource Types
Built-in types

Added by others

Generic Application Generic Service Internet Information Server (IIS) Virtual Root Network Name TCP/IP Address Physical Disk FT Disk (Software RAID) Print Spooler File Share
Microsoft SQL Server, Message Queues, Exchange Mail Server, Oracle, SAP R/3 Your application? (use developer kit wizard).
31
Physical Disk
32
TCP/IP Address
33
Network Name
34
File Share
35
IIS (WWW/FTP) Server
36
Print Spooler
37
Resources states:
Resource States
Im Online!
Online Pending Failed Online
Offline: exists, not offering service Online: offering service Failed: not able to offer service
Go Off-line!
Offline Pending
Resource failure may cause:

local restart other resources to go offline resource group to move
Go Online!
Im here!
Offline
Im Off-line!
(all subject to group and resource parameters)
Resource failure detected by:

Polling failure Node failure
38
Resource Dependencies

Similar to NT Service Dependencies Orderly startup & shutdown

A resource is brought online after any resources it depends on are online. A Resource is taken offline before any resources it depends on File Share Form dependency trees move among nodes together failover together as per resource group
IIS Virtual Root Network Name
Interdependent resources

IP Address Resource DLL

39
Dependencies Tab
40
Stores all configuration information

Software Hardware
NT Registry
Hierarchical (name, value) map Has a open, documented interface Is secure Is visible across the net (RPC interface) Typical Entry:
\Software\Microsoft\MSSQLServer\MSSQLServer\ DefaultLogin = GUEST DefaultDomain = REDMOND
41
Cluster Registry

Separate from local NT Registry Replicated at each node
Algorithms explained later
Maintains configuration information:

Cluster members Cluster resources Resource and group parameters (e.g. restart)
Stable storage Refreshed from master copy when node joins cluster
42
Other Resource Properties

Name Restart policy (restart N times, failover) Startup parameters Private configuration info (resource type specific) Per-node as well, if necessary Poll Intervals (LooksAlive, IsAlive, Timeout) These properties are all kept in Cluster Registry
43
General Resource Tab
44
Advanced Resource Tab
45
Resource Groups
Cluster
Group
Resource
Payroll Group
Web Server SQL Server
Every resource belongs to a
resource group. Resource groups move (failover) as a unit
Dependencies NEVER cross groups. (Dependency trees contained within groups.) Group may contain forest of dependency trees
IP Address
Drive E:
Drive F:
46
Moving a Resource Group
47
Group Properties

CurrentState: Online, Partially Online, Offline Members: resources that belong to group
members determine which nodes can host group.
Preferred Owners: ordered list of host nodes FailoverThreshold: How many faults cause failover FailoverPeriod: Time window for failover threshold FailbackWindowsStart: When can failback happen? FailbackWindowEnd: When can failback happen? Everything (except CurrentState) is stored in registry
48
Failover and Failback
Failover parameters

timeout on LooksAlive, IsAlive # local restarts in failure window after this, offline. (during failback window)
Failback to preferred node
Do resource failures affect group?

Node \\Alice Node \\Betty
Failover
Cluster Cluster Failback Service Service
IPaddr name
49
Cluster Concepts Clusters

Cluster Group Group Group Group Resource Resource Resource Resource
50
Cluster Properties

Defined Members: nodes that can join the cluster Active Members: nodes currently joined to cluster Resource Groups: groups in a cluster Quorum Resource:
Stores Used
copy of cluster registry.
to form quorum.
Network: Which network used for communication

All properties kept in Cluster Registry
51
Cluster API Functions

(operations on nodes & groups)

Find and communicate with Cluster Query/Set Cluster properties Enumerate Cluster objects

Nodes Groups Resources and Resource Types Node state and property changes Group state and property changes Resource state and property changes
52
Cluster Event Notifications

Cluster Management
53
Demo

Server startup and shutdown Installing applications Changing status Failing over Transferring ownership of groups or resources Deleting Groups and Resources
54
Outline

55
Architecture
Top tier provides cluster abstractions Middle tier provides distributed operations Bottom tier is NT and drivers
Failover Manager Resource Monitor Cluster Registry Global Update Quorum Membership Windows NT Server Cluster Disk Driver Cluster Net Drivers
56
Membership and Regroup
Membership:
Used
for orderly addition and removal from { active nodes }
Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers
57
Regroup:
Used
for failure detection (via heartbeat messages) Forceful eviction from { active nodes }
Membership

Defined cluster = all nodes Active cluster:

58
Subset of defined cluster Includes Quorum Resource Stable (no regroup in progress)
Quorum Resource

Usually (but not necessarily) a SCSI disk Requirements:

Arbitrates for a resource by supporting the
challenge/defense protocol Capable of storing cluster registry and logs
Configuration Change Logs

Tracks changes to configuration database when
any defined member missing (not active) Prevents configuration partitions in time
59
Challenge/Defense Protocol

SCSI-2 has reserve/release verbs

Semaphore on disk controller
Owner gets lease on semaphore Renews lease once every 3 seconds To preempt ownership:
Challenger clears semaphore (SCSI bus reset) Waits 10 seconds
3 seconds for renewal + 2 seconds bus settle time x2 to give owner two chances to renew
If still clear, then former owner loses lease Challenger issues reserve to acquire semaphore
60
Challenge/Defense Protocol: Successful Defense

Defender Node
Reserve Reserve Reserve Reserve Reserve
10
11
12
13
14
15
16
Bus Reset
Reservation detected
Challenger Node
61
Challenge/Defense Protocol: Successful Challenge

Defender Node
Reserve
10
11
12
13
14
15
16
Bus Reset
Reserve No reservation detected
Challenger Node
62
Regroup

Invariant: All members agree on { members } Regroup re-computes { members } Each node sends heartbeat message to a peer (default is one per second) Regroup if two lost heartbeat messages

63
suspicion that sender is dead failure detection in bounded time Checks communication among nodes. Suspected missing node may survive.
Uses a 5-round protocol to agree.

Upper levels (global update, etc.) informed of regroup event.
Membership State Machine

Initialize Start Cluster Member Search Found Online Member Minority or no Quorum Sleeping Search Fails Search or Reserve Fails
Quorum Disk Search
Regroup
Acquire (reserve) Quorum Disk Forming
Joining
Non-Minority and Quorum
Lost Heartbeat
Join Succeeds
Online
Synchronize Succeeds
64
Joining a Cluster

When a node starts up, it mounts and configures only local, non-cluster devices Starts Cluster Service which
looks in local (stale) registry for members Asks each member in turn to sponsor new nodes
membership. (Stop when sponsor found.)
Sponsor (any active member)

Sponsor authenticates applicant Broadcasts applicant to cluster members Sponsor sends updated registry to applicant Applicant becomes a cluster member
65
Forming a Cluster (when Joining fails)

Use registry to find quorum resource Attach to (arbitrate for) quorum resource Update cluster registry from quorum resource
e.g. if we were down when it was in use
Form new one-node cluster Bring other cluster resources online Let others join your cluster
66
Leaving A Cluster (Gracefully)
Pause:
Move all groups off this member. Change to paused state (remains a cluster member)
Offline:
Move all groups off this member. Sends ClusterExit message all cluster members
Prevents regroup Prevents stalls during departure transitions

Close Cluster connections
(now not an active cluster member) Cluster service stops on node
Evict: remove node from defined member list

67
Leaving a Cluster (Node Failure)

Node (or communication) failure triggers Regroup If after regroup:
Minority group OR no quorum device: group does NOT survive Non-minority group AND quorum device: group DOES survive Number of new members >= 1/2 old active cluster Prevents minority from seizing quorum device at the expense of a larger potentially surviving cluster
Non-Minority rule:

Quorum guarantees correctness Prevents split-brain

e.g. with newly forming cluster containing a single node
68
Global Update

Propagates updates to all nodes in cluster Used to maintain replicated cluster registry Updates are atomic and totally ordered Tolerates all benign failures. Depends on membership

69
all are up all can communicate
R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol.
Global Update Algorithm
Cluster has locker node that regulates updates.
Oldest active node in cluster
Send Update to locker node Update other (active) nodes
L
00 !
Failure of all updated nodes:

Update never happened Updated nodes will roll back on recovery New locker is oldest and so has update if any do. New locker restarts update
Survival of any updated nodes:

ac k
X= 1
in seniority order (e.g. locker first) this includes the updating node
70
Cluster Registry

Separate from local NT Registry Maintains cluster configuration
71
members, resources, restart parameters, etc.
Stable storage Replicated at each member

Global Update protocol NT Registry keeps local copy
Cluster Registry Bootstrapping
Membership uses Cluster Registry for list of nodes

Circular dependency
72
Solution:
Membership uses stale
local cluster registry Refresh after joining or forming cluster Master is either
quorum device, or active members
Resource Monitor

Polls resources:
IsAlive and LooksAlive
73
Detects failures
polling failure failure event from resource
Higher levels tell it

Online, Offline Restart
Failover Manager
Assigns groups to nodes based on

Failover
74
parameters Possible nodes for each resource in group Preferred nodes for resource group
Failover (Resource Goes Offline)

Resource Manager Detects resource error. Notify Failover Manager. Failover Manager checks: Failover Window and Failover Threshold Attempt to restart resource. Wait for Failback Window Are Failover conditions within Constraints? No
No
Has the Resource Retry limit been exceeded?
Yes Yes Switch resource (and Dependants) Offline. Can another owner be found? (Arbitration) Yes No
Leave Group in partially Online state.
Notify Failover Manager on the new system to bring resource Online.
75
Pushing a Group (Resource Failure)

Resource Monitor notifies Resource Manager of resource failure. Resource Manager enumerates all objects in the Dependency Tree of the failed resource. Resource Manager notifies Failover Manager that the Dependency Tree is Offline and needs to fail over.
Resource Manager takes each depending resource Offline.
Failover Manager performs Arbitration to locate a new owner for the group.
Leave Group in partially Online state.
No
Any resource has Affect the Group True
Yes
Failover Manager on the new owner node brings the resources Online.
76
Pulling a Group (Node Failure)

Cluster Service notifies Failover Manager of node failure.
Failover Manager determines which groups were owned by the failed node.
Failover Manager performs Arbitration to locate a new owner for the groups.
Resource Manager notifies Failover Manager that the node is Offline and the groups it owned need to fail over.
Failover Manager on the new owner(s) bring the resources Online in dependency order.
77
Failback to Preferred Owner Node

Group may have a Preferred Owner Preferred Owner comes back online Will only occur during the Failback Window (time slot, e.g. at night)
Preferred owner comes back Online. Resource Manager takes each resource on the current owner Offline.
Is the time within the Failback Window?
Resource Manager notifies Failover Manager that the Group is Offline and needs to fail over to the Preferred Owner.
Failover Manager performs Arbitration to locate the Preferred Owner of the group.
Failover Manager on the Preferred Owner brings the resources Online.
78
Outline

79
Process Structure
Cluster Service

A Node
Failover Manager Cluster Registry Global Update Quorum Membership Resource Monitor Service Resource DLLs
Resource Monitor
Private calls
Resource Monitor Cluster

Resources

Resource Monitor DLL

Private calls
Services Applications
Resource
80
Resource Control
Commands

CreateResource() OnlineResource() OfflineResource() TerminateResource() CloseResource() ShutdownProcess()
A Node
Resource Monitor
And resource events
Cluster Service
Private calls
Private calls
Resource
81
Resource DLLs
Im Online!
Online Pending
Online
Go Off-line!
Calls to Resource DLL

Failed
Offline Pending
Open: get handle Online: start offering service Offline: stop offering service
as a standby or pair-is offline
Go Online!
Im here!
Offline
Im Off-line!
LooksAlive: Quick check IsAlive: Thorough check Terminate: Forceful Offline Close: release handle

Std calls
Private calls
Resource
82
Cluster Communications

Most communication via DCOM /RPC UDP used for membership heartbeat messages Standard (e.g. Ethernet) interconnects
Management apps DCOM Cluster Service DCOM / RPC Resource Monitors Resource Monitors DCOM / RPC: admin UDP: Heartbeat DCOM Cluster Service DCOM / RPC Resource Monitors Resource Monitors
83
Outline

84
Application Support

Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API
85
Virtual Servers
Problem:
Client and Server Applications do not want node name to change when server app moves to another node. Resource Group (name, disks, databases,) NetName and IP address (node: \\a keeps name and IP address as is moves) Virtual Registry (registry moves (is replicated)) Virtual Service Control Virtual RPC service Limit app to virtual servers devices and services. Client reconnect on failover (easy if connectionless -- eg web-clients)
A Virtual Server simulates an NT Node

Virtual Server \\a:1.2.3.4
Challenges:

Virtual Server \\a: 1.2.3.4
86
Virtual Servers (before failover)

Nodes \\Y and \\Z support virtual servers \\A and \\B Things that need to fail over transparently
Client connection Server dependencies Service names Binding to local resources Binding to local servers
SAP SQL S:\ \\A
\\Y
\\Z
SAP SQL T:\ \\B
SAP on A
SAP on B
87
Virtual Servers (just after failover)

\\Y resources and groups (i.e. Virtual Server \\A) moved to \\Z A resources bind to each other and to local resources (e.g., local file system)

\\Y
\\Z
SAP SQL S:\ SAP SQL T:\
Registry Physical resource Security domain Time
\\A
\\B
Transactions used to make DB state consistent. To work, local resources on \\Y and \\Z have to be similar SAP on A
SAP on B
E.g. time must remain monotonic after failover

88
Address Failover and Client Reconnection
Name and Address rebind to new node
\\Y
\\Z
Details later Failure not transparent Must log on again Client context lost (encourages connectionless) Applications could maintain context
SAP on A
Clients reconnect

\\A
\\B
SAP on B
89
Mapping Local References to Group-Relative References
Send client requests to correct server

\\Y
\\Z
\\A\SAP refers to \\.\SQL \\B\SAP refers to \\.\SQL \\A\SAP to \\.\SQL$A \\B\SAP to \\.\SQL$B
Must remap references:

Also handles namespace collision Done via

\\A
\\B
modifying server apps, or DLLs to transparently rename
SAP on A
SAP on B
90
Naming and Binding and Failover
Services rely on the NT node name and - or IP address to advertise Shares, Printers, and Services. Applications register names to advertise services Example: \\Alice\SQL (i.e. <node><service>) Example: 128.2.2.2:80 (=http://www.foo.com/) Binding Clients bind to an address (e.g. name->IP address) Thus the node name and IP address must failover along with the services (preserve client bindings)
91
Client to Cluster Communications IP address mobility based on MAC rebinding

IP rebinds to failover MAC addr Transparent to client or server Low-level ARP (address resolution protocol) rebinds IP add to new MAC addr.
Client Alice <-> 200.110.12.4 Virtual Alice <-> 200.110.12.5 Betty <-> 200.110.12.6 Virtual Betty <-> 200.110.12.7
Cluster Clients Must use IP (TCP, UDP, NBT,... ) Must Reconnect or Retry after failure Cluster Servers All cluster nodes must be on same LAN segment
Alice <-> 200.110.120.4 Virtual Alice <-> 200.110.120.5
WAN Router:
200.110.120.4 ->AliceMAC 200.110.120.5 ->AliceMAC 200.110.120.6 ->BettyMAC 200.110.120.7 ->BettyMAC
Betty <-> 200.110.120.6 Virtual Betty <-> 200.110.120.7
Local Network
92
Time
Time must increase monotonically

Otherwise applications get confused e.g. make/nmake/build Not hard, since failover on order of seconds
Time is maintained within failover resolution
Time is a resource, so one node owns time resource Other nodes periodically correct drift from owners time
93
Application Local NT Registry Checkpointing

Resources can request that local NT registry subtrees be replicated Changes written out to quorum device
Uses registry change notification interface
Changes read and applied on fail-over

\\A on \\X registry
Eac h up da te
\\A on \\B
l ov Fai er Aft er
registry
registry
Quorum Device
94
Registry Replication
95
Application Support

96
Generic Resource DLLs
Generic Application DLL

Simplest: just starts, stops application, and
makes sure process is alive
Generic Service DLL

Translates DLL calls into equivalent NT
Server calls

Online => Service Start Offline => Service Stop Looks/IsAlive => Service Status
Resource Monitor DLL Private
Std calls
calls
Resource
97
Generic Application
98
Generic Service
99
Application Support

100
Resource DLL VC++ Wizard

Asks for resource type name Asks for optional service to control Asks for other parameters (and associated types) Generates DLL source code Source can be modified as necessary
E.g. additional checks for Looks/IsAlive
101
Creating a New Workspace
102
Specifying Resource Type Name
103
Specifying Resource Parameters
104
Automatic Code Generation
105
Customizing The Code
106
Application Support

107
Cluster API
Allows resources to:

Examine dependencies Manage per-resource data Change parameters (e.g. failover) Listen for cluster events etc.
Specs & API became public Sept 1996 On all MSDN Level 3 On web site:
http://www.microsoft.com/clustering.htm
108
Cluster API Documentation
109
Outline

110
Research Topics?

Even easier to manage Transparent failover Instant failover Geographic distribution (disaster tolerance) Server pools (load-balanced pool of processes) Process pair (active/backup process) 10,000 nodes? Better algorithms Shared memory or shared disk among nodes
a truly bad idea?
111
Microsoft NT site: http://www.microsoft.com/ntserver/ BARC site: http://research.microsoft.com/BARC These slides: http://research.microsoft.com/~joebar/ftcs-27/ftcs20.ppt Inside Windows NT,
H. Custer, Microsoft Pr, ISBN: 155615481
References
Tandem Global Update Protocol,
R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol. Kronenberg, N., Levey, H., Strecker, W., ACM TOCS, V 4.2 1986. A (the) shared disk cluster. Gregory F. Pfister, Prentice Hall, 1995, ISBN: 0134376250. Argues for shared nothing Gray, J., Reuter A., Morgan Kaufmann, 1994. ISBN 1558601902, survey of outages, transaction techniques.
112
VAXclusters: a Closely Coupled Distributed System,
In Search of Clusters : The Coming Battle in Lowly Parallel Computing,
Transaction Processing Concepts and Techniques,

Ftcs 20

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Ftcs 20

Hochgeladen von

Copyright:

Verfügbare Formate

FT NT: A Tutorial on Microsoft Cluster Server

1996, 1997 Microsoft Corp.

1996, 1997 Microsoft Corp.

DEPENDABILITY: The 3 ITIES

Does the right

(also large MTTF)

If 90% of terminals up & 99% of DB up?

Integrity Security Reliability Availability

Holistic vs. Reductionist view

1996, 1997 Microsoft Corp.

Case Study - Japan

Tele Comm lines Application Software

Months Months Years Years Years Weeks

To Get 10 Year MTTF, Must Attack All These Areas

1996, 1997 Microsoft Corp.

Case Studies - Tandem Trends

Outag es/ 1000 S yste m Y ears by Primar y Caus e

% of Outage s b y Pri mary Cause

1996, 1997 Microsoft Corp.

100-year MTTF. class 4 today => class 6 tomorrow.

1996, 1997 Microsoft Corp.

Fault Tolerance vs Disaster Tolerance

Fault-Tolerance: mask local faults

Disaster Tolerance: masks site failures

1996, 1997 Microsoft Corp.

The Microsoft Vision: Plug & Play Dependability

Integrity Security Integrity / Reliability Availability

1996, 1997 Microsoft Corp.

un-availability ~ MTTR / MTBF , so quick repair.

Detect/warn administrators of failures

processing storage bandwidth

1996, 1997 Microsoft Corp.

Failures are independent

So, single fault tolerance is a big win

Operations tasks: major source of outage

operations Software upgrades

1996, 1997 Microsoft Corp.

Cluster: Servers Combined to Improve Availability & Scalability

Cluster: A group of independent systems working

Server A Disk array A Interconnect Disk array B

1996, 1997 Microsoft Corp.

Microsoft Cluster Server

2-node availability Summer 97 (20,000 Beta Testers now)

16-node scalability later (next year?)

1996, 1997 Microsoft Corp.

1996, 1997 Microsoft Corp.

MS Press Failover Demo

Client/Server Software failure Admin shutdown Server failure

1996, 1997 Microsoft Corp.

Windows NT Server Cluster

Windows NT Workstatio Internet Explorer MS Press OLTP app

1996, 1997 Microsoft Corp.

SCSI Disk Cabinet

Windows NT Server Cluster

Cluster Admin Console

SQL Enterprise Mgr

Generic Stateless Application Rotating Globe

(= process exits) in 3 minutes settable default

1996, 1997 Microsoft Corp.

Demo Moving or Failing Over An Application

AVI Applicati Local on SCSI Disk Cabinet Disks Shared Disks

AVI Applicati Local on Disks