Sie sind auf Seite 1von 34

VERITAS Cluster Server

3.5 for Solaris


Lesson 10
Faults and Failovers

Overview
Troubleshooting

Using Volume
Manager
Event
Notification
Service Group
Basics
Introduction
VCS_3.5_Solaris_R3.5_2002091
5

Cluster
Communication

Faults and
Failovers
Preparing
Resources

Terms
and
Concepts

Installing
Applications
Resources
and Agents

Installing
VCS

Managing
Cluster
Services

NFS
Resources
Using
Cluster
Manager
10-2

Objectives
After completing this lesson, you will be able to:
Describe how VCS responds to faults.
Implement failover policies.
Set limits and prerequisites.
Use system zones to control failover.
Control failover behavior using attributes.
Clear faults.
Probe resources.
Flush service groups.
Test failover.
VCS_3.5_Solaris_R3.5_2002091
5

10-3

How VCS Responds to Faults


Call resfault (if present)
Offline all resources in path

Offline entire service group


Y
System
available in
SystemList?

Critical
online resource
in path?

Keep group partially online


N

Keep service group offline


N

Run NoFailover trigger

Start service group elsewhere


VCS_3.5_Solaris_R3.5_2002091
5

10-4

Practice Exercise
Case

NonCritical

Offline

7
5
6

6,7

4,6

4,6,7

3
1

Resource 4 Faults
VCS_3.5_Solaris_R3.5_2002091
5

Taken
offline
due to
fault

Starts on
another
system

10-5

Practice Answers

5
3
1

NonCritical

Offline

Taken
offline
due to
fault

Starts on
another
system

6,7

All

6,7

All

6,7

4,6

6,7

All

4,6,7

6,7

All but 7

7
6
4

8
9

Resource 4 Fails
VCS_3.5_Solaris_R3.5_2002091
5

Case

10-6

Failover Policies
The AutoFailOver attribute indicates whether automatic
failover is enabled for the service group.
Default value is 1, enabled.
The FailOverPolicy attribute specifies how a target
system is selected:
PrioritySystem with the lowest priority number in the list is
selected (default).
RoundRobinSystem with the least number of active service
groups is selected.
LoadSystem with greatest available capacity is selected.

Example configuration:
hagrp modify group AutoFailOver 0
hagrp modify group FailOverPolicy Load
VCS_3.5_Solaris_R3.5_2002091
5

10-7

Priority Failover Policy


Lowest numbered system in SystemList selected

AP1

Svr1

SystemList = {Svr1 = 0, Svr2 = 1}


DB
AP2

Svr2

Svr3

SystemList = {Svr3=0, Svr1=1, Svr2=2}

SystemList = {Svr2 = 0, Svr1 = 1}


VCS_3.5_Solaris_R3.5_2002091
5

10-8

Round Robin Failover Policy


System with fewest running service groups selected

Svr1

Svr2

VCS_3.5_Solaris_R3.5_2002091
5

Svr3

Svr4

10-9

Load Failover Policy


1.

Define system Capacity based on server capability.

We decide each of these servers has a


Capacity of 300.

Capacity

300

300

300

This server has a


Capacity of 150.

150

- Load
= Available
VCS_3.5_Solaris_R3.5_2002091
5

10-10

Determining Load
1.

Define system Capacity based on server capability.

2.

Define group Load based on application requirements.

iPlanet requires
100 units of Load
iPlanet
100
Load

Capacity
-

Load

Sybase
125
Load

300

100
= Available 200
VCS_3.5_Solaris_R3.5_2002091
5

Sybase
requires
125

Oracle 8i
requires
150
Oracle 8i
150
Load
NFS1
75
Load

300
125
175

NFS shares
require 75
each
NFS2
75
Load
NFS3
75
Load

300
150 +75
75

150
75+75
0
10-11

Determining the Failover Target


1.
2.
3.

Oracle 8i FAILS.
VCS brings Oracle 8i online on the server with 200
AvailableCapacity.
VCS recalculates AvailableCapacity based on new Load.

iPlanet
100
Load
Oracle 8i
150
Load
Capacity
-

Load

Sybase
125
Load

300

100
= Available 200
VCS_3.5_Solaris_R3.5_2002091
5

300
125
175

Oracle 8i
150
Load
NFS1

NFS2
75
Load
NFS3

75
Load

75
Load
300
150 +75
75

150
75+75
0
10-12

Tracing a Server Failure


1.
2.
3.

The NFS server FAILS.


VCS brings NFS2 online on the server with 225
AvailableCapacity.
VCS recalculates AvailableCapacity based on new Load.

iPlanet
100
Load
Oracle 8i
150
Load
Capacity
-

Load

Sybase
125
Load

300

250
50
= Available
VCS_3.5_Solaris_R3.5_2002091
5

300
125
175

NFS2
75
Load
NFS1

NFS2
75
Load
NFS3

75
Load

75
Load
300
75
225

150
75+75
0
10-13

Completing Fail Over


4.
5.

VCS brings NFS2 online on the server with 175


AvailableCapacity.
VCS recalculates AvailableCapacity based on new
Load.

iPlanet
100
Load
Oracle 8i
150
Load
Capacity
-

Load

Sybase
125
Load
NFS3
75
Load
300

250
50
= Available
VCS_3.5_Solaris_R3.5_2002091
5

NFS2
75
Load
NFS1

NFS3
75
Load

75
Load
300
125
175

300
150
150

150
75+75
0
10-14

Setting Load and Capacity


The Load and Capacity attributes are
user-defined values.
Set attributes using the hagrp and hasys
commands.
Examples:
hasys modify LgSrv1 Capacity 300
hagrp modify OracleSG Load 150

AvailableCapacity calculated by VCS:


Capacity minus Load equals AvailableCapacity
VCS_3.5_Solaris_R3.5_2002091
5

10-15

Dynamic Load Balancing


1.
2.
3.

External software monitors CPU utilization (30, 40, 75, and 80


percent utilization for systems shown below).
The software sets DynamicLoad attribute according to system
Capacity value using hasys load system value.
For example, if CPU utilization is 30% and Capacity is set to
300, set Dynamic load to 90 (30% of 300).
30%

iPlanet
100
Load

Capacity
- DynLoad

Sybase
125
Load

300

90
= Available 210
VCS_3.5_Solaris_R3.5_2002091
5

40%

75%
Oracle 8i
150
Load
NFS1
75
Load

300
120
175

80%
Proces
s
40
Proces
Load
s

300
225
75

40
Load

100
80
20
10-16

The LoadWarning Trigger


1.
2.
3.

Runs when system has been running at a specified percent of


Capacity level for a specified period of time.
Configured by placing loadwarning script in
/opt/VRTSvcs/bin/triggers and setting system attributes.
This example configuration causes VCS to run the trigger if
system Srv4 runs at 90 percent of capacity for ten minutes.
80%

System Svr4 (
Sybase
Oracle 8i
main.cf Capacity=100
125
150
LoadWarningLevel=90
Load
Load
NFS1
LoadTimeThreshold=600
)
75
Load
Capacity
- DynLoad

300

90
= Available 210
VCS_3.5_Solaris_R3.5_2002091
5

300
120
175

Proces
s
40
Proces
Load
s
300
225
75

40
Load

Srv4
100
80
20
10-17

System Limits
1.

Define system Limits based on the server properties:


Limits = {Processors-4, Mem=512}
Each of these servers has:
Processors=4
Mem=512

Limits

4,512

4,512

This server has:


Processors=1
Mem=128

4,512

1,128

- Prereq
= Current
VCS_3.5_Solaris_R3.5_2002091
5

10-18

Service Group Prerequisites


1.

Define system Limits based on the server properties.

2.

Define service group Prerequisites based on application


requirements.
iPlanet
Sybase requires Oracle requires
NFS requires
requires
2 Processors
1 Proc
1 Processor
1 Processor
256 Mb RAM
48 Mb RAM
184 Mb RAM
212 Mb RAM
iPlanet:
Sybase:
Oracle 8i
NFS2
1, 184
1,212
2,256
1,48
NFS1
NFS3
1,48
1,48
Limits

4,512

4,512

4,512

1,128

- Prereq

1,184
1,328

1,212

3,304

1,96

1,300

1,208

1,32

= Current
VCS_3.5_Solaris_R3.5_2002091
5

10-19

Combining Capacity and Limits


When used together, VCS determines the failover
target as follows:
Limits and Prerequisites are used to determine a subset of
potential failover targets.
Of this subset, the system with the highest value for
AvailableCapacity is selected.
If multiple systems have the same AvailableCapacity, the first
system in SystemList is selected.
Limits are hard valuesif a system does not meet the
Prerequisites, the service group cannot be started on that system.
Capacity is a soft limit the system with the lowest
AvailableCapacity is selected, even if AvailableCapacity results in
a negative number.
VCS_3.5_Solaris_R3.5_2002091
5

10-20

Failover Zones
Preferred failover zone for
database service group
sysa

Preferred failover
zone for Web service group

sysb
sysc

sysd

syse

sysf

Database
Web
The SystemList for both service groups includes all systems in the
cluster.
VCS_3.5_Solaris_R3.5_2002091
5

10-21

SystemZones Attribute
Used to define the preferred failover zones for each service
group.
If the service group is online in a system zone, it fails to other
systems in the same zone based on the FailOverPolicy, until
there are no systems available in that zone.
When there are no other systems for failover in the same zone,
VCS chooses a system in a new zone from the SystemList based
on the FailOverPolicy.
To define SystemZones:
Syntax:
hagrp modify group_name SystemZones \
sys1 zone# sys2 zone# sys zone#
Example:
hagrp modify OracleSG SystemZones sysa \
0 sysb 0 sysc 1 sysd 1 syse 1 sysf 1
VCS_3.5_Solaris_R3.5_2002091
5

10-22

Controlling Failover Behavior with


Resource Type Attributes
RestartLimit

Affects how the agent responds to a resource fault


Default: 0

ConfInterval
Determines the amount of time that a tolerance or restart
counter can be incremented
Default: 600 seconds

ToleranceLimit
Enables the monitor entry point to return OFFLINE several
times before the resource is declared FAULTED
Default: 0

VCS_3.5_Solaris_R3.5_2002091
5

10-23

Restart Example
RestartLimit=1
Resource to be restarted one time within
the ConfInterval timeframe
ConfInterval=180
Resource can be restarted once within a three
minute interval.
MonitorInterval=60 seconds (default value)
Resource is monitored every 60 seconds.
ConfInterval

Online

MonitorInterval
VCS_3.5_Solaris_R3.5_2002091
5

Online

Offline

Restart

Online

Offline

Faulted
10-24

Adjusting Monitoring
MonitorInterval:
Default value is 60 seconds for most resource
types.
Consider reducing to 10 or 20 seconds for testing.
Use caution when changing this value:
Load is increased on cluster systems.
Resources can fault if they cannot respond in the
interval specified.

OfflineMonitorInterval:
Default is 300 seconds for most resource types.
Consider reducing to 60 seconds for testing.
VCS_3.5_Solaris_R3.5_2002091
5

10-25

Modifying Resource Type


Attributes
Can be used to optimize agents
Applied to all resources of the specified type
Command line example:
hatype modify FileOnOff MonitorInterval 5

VCS_3.5_Solaris_R3.5_2002091
5

10-26

Preventing Failover
Frozen service group does not fail over when a critical
resource faults.
Service group must be unfrozen to enable fail over.
To freeze a service group:
hagrp -freeze service_group [-persistent]

To unfreeze a service group:


hagrp -unfreeze service_group [-persistent]

A persistent freeze:
Requires the cluster configuration to be open
Remains in effect even if VCS stopped and restarted throughout
the cluster
VCS_3.5_Solaris_R3.5_2002091
5

10-27

Clearing Faults
Verify that the faulted resource is offline.
Fix the problem that caused the fault and clean
up any residual effects.
To clear a fault, type:
hares -clear resource_name [-sys system_name]

To clear all faults in a service group, type:


hagrp -clear group_name [-sys system_name]

Persistent resources are cleared by probing:


hares -probe resource_name -sys system_name
VCS_3.5_Solaris_R3.5_2002091
5

10-28

Probing Resources
Causes VCS to immediately monitor the
resource
To probe a resource, type:
hares probe resource_name sys system_name

You can clear a persistent resource by probing it


after the underlying problem has been fixed.

VCS_3.5_Solaris_R3.5_2002091
5

10-29

Flushing Service Groups


All online/offline agent processes are stopped.
All resources in transitional states waiting to go online
are taken offline.
Propagation of the offline operation is stopped, but
resources waiting to go offline remain in the
transitional state.
You must verify the physical or software resources are
stopped at the operating system level after flushing to
avoid creating a concurrency violation.
To flush a service group, type:
hagrp flush group_name sys system_name
VCS_3.5_Solaris_R3.5_2002091
5

10-30

Testing Failover
Use test resources, such as FileOnOff, when
applicable.
Set lower values for MonitorInterval,
OfflineMonitorInterval, and ConfInterval to detect
faults more quickly.
Manually online, offline, and switch the service group
among all systems.
Simulate failure of each resource in the service
group.
Simulate failover of the entire system.
VCS_3.5_Solaris_R3.5_2002091
5

10-31

Testing Examples
Force a resource to fault.
Reboot a system.
Halt and reboot a system.
Remove power from a system.

VCS_3.5_Solaris_R3.5_2002091
5

10-32

Summary
You should now be able to:
Describe how VCS responds to faults.
Implement failover policies.
Set limits and prerequisites.
Use system zones to control failover.
Control failover behavior using attributes.
Clear faults.
Probe resources.
Flush service groups.
Test failover.
VCS_3.5_Solaris_R3.5_2002091
5

10-33

Lab 10: Faults and Failovers


Student Red

Student Blue

BlueNFSSG

RedNFSSG

resfault
nofailover
sysoffline
VCS_3.5_Solaris_R3.5_2002091
5

Triggers

resfault
nofailover
sysoffline
10-34

Das könnte Ihnen auch gefallen