Vcs35 Sol Les10

VERITAS Cluster Server
3.5 for Solaris

Lesson 10
Faults and Failovers
Overview
Troubleshooting
Using Volume
Manager
Event
Notification
Service Group
Basics
Introduction
VCS_3.5_Solaris_R3.5_2002091
5
Cluster
Communication
Faults and
Failovers
Preparing
Resources
Terms
and
Concepts
Installing
Applications
Resources
and Agents
Installing
VCS
Managing
Cluster
Services
NFS
Resources
Using
Cluster
Manager
10-2
Objectives
After completing this lesson, you will be able to:
Describe how VCS responds to faults.
Implement failover policies.
Set limits and prerequisites.
Use system zones to control failover.
Control failover behavior using attributes.
Clear faults.
Probe resources.
Flush service groups.
Test failover.
VCS_3.5_Solaris_R3.5_2002091
5
10-3
How VCS Responds to Faults

Call resfault (if present)
Offline all resources in path
Offline entire service group

Y
System
available in
SystemList?
Critical
online resource
in path?
Keep group partially online

N
Keep service group offline

N
Run NoFailover trigger
Start service group elsewhere

VCS_3.5_Solaris_R3.5_2002091
5
10-4
Practice Exercise
Case
NonCritical
Offline
7
5
6
6,7
4,6
4,6,7
3
1
Resource 4 Faults
VCS_3.5_Solaris_R3.5_2002091
5
Taken
offline
due to
fault
Starts on
another
system
10-5
Practice Answers
5
3
1
NonCritical
Offline
Taken
offline
due to
fault
Starts on
another
system
6,7
All
6,7
All
6,7
4,6
6,7
All
4,6,7
6,7
All but 7
7
6
4
8
9
Resource 4 Fails
VCS_3.5_Solaris_R3.5_2002091
5
Case
10-6
Failover Policies
The AutoFailOver attribute indicates whether automatic
failover is enabled for the service group.
Default value is 1, enabled.
The FailOverPolicy attribute specifies how a target
system is selected:
PrioritySystem with the lowest priority number in the list is
selected (default).
RoundRobinSystem with the least number of active service
groups is selected.
LoadSystem with greatest available capacity is selected.
Example configuration:
hagrp modify group AutoFailOver 0
hagrp modify group FailOverPolicy Load
VCS_3.5_Solaris_R3.5_2002091
5
10-7
Priority Failover Policy

Lowest numbered system in SystemList selected
AP1
Svr1
SystemList = {Svr1 = 0, Svr2 = 1}

DB
AP2
Svr2
Svr3
SystemList = {Svr3=0, Svr1=1, Svr2=2}
SystemList = {Svr2 = 0, Svr1 = 1}

VCS_3.5_Solaris_R3.5_2002091
5
10-8
Round Robin Failover Policy

System with fewest running service groups selected
Svr1
Svr2
VCS_3.5_Solaris_R3.5_2002091
5
Svr3
Svr4
10-9
Load Failover Policy

1.
Define system Capacity based on server capability.
We decide each of these servers has a

Capacity of 300.
Capacity
300
300
300
This server has a

Capacity of 150.
150
- Load
= Available
VCS_3.5_Solaris_R3.5_2002091
5
10-10
Determining Load
1.
Define system Capacity based on server capability.
2.
Define group Load based on application requirements.
iPlanet requires
100 units of Load
iPlanet
100
Load
Capacity
-
Load
Sybase
125
Load
300
100
= Available 200
VCS_3.5_Solaris_R3.5_2002091
5
Sybase
requires
125
Oracle 8i
requires
150
Oracle 8i
150
Load
NFS1
75
Load
300
125
175
NFS shares
require 75
each
NFS2
75
Load
NFS3
75
Load
300
150 +75
75
150
75+75
0
10-11
Determining the Failover Target

1.
2.
3.
Oracle 8i FAILS.
VCS brings Oracle 8i online on the server with 200
AvailableCapacity.
VCS recalculates AvailableCapacity based on new Load.
iPlanet
100
Load
Oracle 8i
150
Load
Capacity
-
Load
Sybase
125
Load
300
100
= Available 200
VCS_3.5_Solaris_R3.5_2002091
5
300
125
175
Oracle 8i
150
Load
NFS1
NFS2
75
Load
NFS3
75
Load
75
Load
300
150 +75
75
150
75+75
0
10-12
Tracing a Server Failure

1.
2.
3.
The NFS server FAILS.

VCS brings NFS2 online on the server with 225
AvailableCapacity.
VCS recalculates AvailableCapacity based on new Load.
iPlanet
100
Load
Oracle 8i
150
Load
Capacity
-
Load
Sybase
125
Load
300
250
50
= Available
VCS_3.5_Solaris_R3.5_2002091
5
300
125
175
NFS2
75
Load
NFS1
NFS2
75
Load
NFS3
75
Load
75
Load
300
75
225
150
75+75
0
10-13
Completing Fail Over

4.
5.
VCS brings NFS2 online on the server with 175

AvailableCapacity.
VCS recalculates AvailableCapacity based on new
Load.
iPlanet
100
Load
Oracle 8i
150
Load
Capacity
-
Load
Sybase
125
Load
NFS3
75
Load
300
250
50
= Available
VCS_3.5_Solaris_R3.5_2002091
5
NFS2
75
Load
NFS1
NFS3
75
Load
75
Load
300
125
175
300
150
150
150
75+75
0
10-14
Setting Load and Capacity

The Load and Capacity attributes are
user-defined values.
Set attributes using the hagrp and hasys
commands.
Examples:
hasys modify LgSrv1 Capacity 300
hagrp modify OracleSG Load 150
AvailableCapacity calculated by VCS:

Capacity minus Load equals AvailableCapacity
VCS_3.5_Solaris_R3.5_2002091
5
10-15
Dynamic Load Balancing

1.
2.
3.
External software monitors CPU utilization (30, 40, 75, and 80

percent utilization for systems shown below).
The software sets DynamicLoad attribute according to system
Capacity value using hasys load system value.
For example, if CPU utilization is 30% and Capacity is set to
300, set Dynamic load to 90 (30% of 300).
30%
iPlanet
100
Load
Capacity
- DynLoad
Sybase
125
Load
300
90
= Available 210
VCS_3.5_Solaris_R3.5_2002091
5
40%
75%
Oracle 8i
150
Load
NFS1
75
Load
300
120
175
80%
Proces
s
40
Proces
Load
s
300
225
75
40
Load
100
80
20
10-16
The LoadWarning Trigger

1.
2.
3.
Runs when system has been running at a specified percent of

Capacity level for a specified period of time.
Configured by placing loadwarning script in
/opt/VRTSvcs/bin/triggers and setting system attributes.
This example configuration causes VCS to run the trigger if
system Srv4 runs at 90 percent of capacity for ten minutes.
80%
System Svr4 (
Sybase
Oracle 8i
main.cf Capacity=100
125
150
LoadWarningLevel=90
Load
Load
NFS1
LoadTimeThreshold=600
)
75
Load
Capacity
- DynLoad
300
90
= Available 210
VCS_3.5_Solaris_R3.5_2002091
5
300
120
175
Proces
s
40
Proces
Load
s
300
225
75
40
Load
Srv4
100
80
20
10-17
System Limits
1.
Define system Limits based on the server properties:

Limits = {Processors-4, Mem=512}
Each of these servers has:
Processors=4
Mem=512
Limits
4,512
4,512
This server has:

Processors=1
Mem=128
4,512
1,128
- Prereq
= Current
VCS_3.5_Solaris_R3.5_2002091
5
10-18
Service Group Prerequisites

1.
Define system Limits based on the server properties.
2.
Define service group Prerequisites based on application

requirements.
iPlanet
Sybase requires Oracle requires
NFS requires
requires
2 Processors
1 Proc
1 Processor
1 Processor
256 Mb RAM
48 Mb RAM
184 Mb RAM
212 Mb RAM
iPlanet:
Sybase:
Oracle 8i
NFS2
1, 184
1,212
2,256
1,48
NFS1
NFS3
1,48
1,48
Limits
4,512
4,512
4,512
1,128
- Prereq
1,184
1,328
1,212
3,304
1,96
1,300
1,208
1,32
= Current
VCS_3.5_Solaris_R3.5_2002091
5
10-19
Combining Capacity and Limits

When used together, VCS determines the failover
target as follows:
Limits and Prerequisites are used to determine a subset of
potential failover targets.
Of this subset, the system with the highest value for
AvailableCapacity is selected.
If multiple systems have the same AvailableCapacity, the first
system in SystemList is selected.
Limits are hard valuesif a system does not meet the
Prerequisites, the service group cannot be started on that system.
Capacity is a soft limit the system with the lowest
AvailableCapacity is selected, even if AvailableCapacity results in
a negative number.
VCS_3.5_Solaris_R3.5_2002091
5
10-20
Failover Zones
Preferred failover zone for
database service group
sysa
Preferred failover
zone for Web service group
sysb
sysc
sysd
syse
sysf
Database
Web
The SystemList for both service groups includes all systems in the
cluster.
VCS_3.5_Solaris_R3.5_2002091
5
10-21
SystemZones Attribute
Used to define the preferred failover zones for each service
group.
If the service group is online in a system zone, it fails to other
systems in the same zone based on the FailOverPolicy, until
there are no systems available in that zone.
When there are no other systems for failover in the same zone,
VCS chooses a system in a new zone from the SystemList based
on the FailOverPolicy.
To define SystemZones:
Syntax:
hagrp modify group_name SystemZones \
sys1 zone# sys2 zone# sys zone#
Example:
hagrp modify OracleSG SystemZones sysa \
0 sysb 0 sysc 1 sysd 1 syse 1 sysf 1
VCS_3.5_Solaris_R3.5_2002091
5
10-22
Controlling Failover Behavior with

Resource Type Attributes
RestartLimit
Affects how the agent responds to a resource fault

Default: 0
ConfInterval
Determines the amount of time that a tolerance or restart
counter can be incremented
Default: 600 seconds
ToleranceLimit
Enables the monitor entry point to return OFFLINE several
times before the resource is declared FAULTED
Default: 0
VCS_3.5_Solaris_R3.5_2002091
5
10-23
Restart Example
RestartLimit=1
Resource to be restarted one time within
the ConfInterval timeframe
ConfInterval=180
Resource can be restarted once within a three
minute interval.
MonitorInterval=60 seconds (default value)
Resource is monitored every 60 seconds.
ConfInterval
Online
MonitorInterval
VCS_3.5_Solaris_R3.5_2002091
5
Online
Offline
Restart
Online
Offline
Faulted
10-24
Adjusting Monitoring
MonitorInterval:
Default value is 60 seconds for most resource
types.
Consider reducing to 10 or 20 seconds for testing.
Use caution when changing this value:
Load is increased on cluster systems.
Resources can fault if they cannot respond in the
interval specified.
OfflineMonitorInterval:
Default is 300 seconds for most resource types.
Consider reducing to 60 seconds for testing.
VCS_3.5_Solaris_R3.5_2002091
5
10-25
Modifying Resource Type

Attributes
Can be used to optimize agents
Applied to all resources of the specified type
Command line example:
hatype modify FileOnOff MonitorInterval 5
VCS_3.5_Solaris_R3.5_2002091
5
10-26
Preventing Failover
Frozen service group does not fail over when a critical
resource faults.
Service group must be unfrozen to enable fail over.
To freeze a service group:
hagrp -freeze service_group [-persistent]
To unfreeze a service group:

hagrp -unfreeze service_group [-persistent]
A persistent freeze:
Requires the cluster configuration to be open
Remains in effect even if VCS stopped and restarted throughout
the cluster
VCS_3.5_Solaris_R3.5_2002091
5
10-27
Clearing Faults
Verify that the faulted resource is offline.
Fix the problem that caused the fault and clean
up any residual effects.
To clear a fault, type:
hares -clear resource_name [-sys system_name]
To clear all faults in a service group, type:

hagrp -clear group_name [-sys system_name]
Persistent resources are cleared by probing:

hares -probe resource_name -sys system_name
VCS_3.5_Solaris_R3.5_2002091
5
10-28
Probing Resources
Causes VCS to immediately monitor the
resource
To probe a resource, type:
hares probe resource_name sys system_name
You can clear a persistent resource by probing it

after the underlying problem has been fixed.
VCS_3.5_Solaris_R3.5_2002091
5
10-29
Flushing Service Groups

All online/offline agent processes are stopped.
All resources in transitional states waiting to go online
are taken offline.
Propagation of the offline operation is stopped, but
resources waiting to go offline remain in the
transitional state.
You must verify the physical or software resources are
stopped at the operating system level after flushing to
avoid creating a concurrency violation.
To flush a service group, type:
hagrp flush group_name sys system_name
VCS_3.5_Solaris_R3.5_2002091
5
10-30
Testing Failover
Use test resources, such as FileOnOff, when
applicable.
Set lower values for MonitorInterval,
OfflineMonitorInterval, and ConfInterval to detect
faults more quickly.
Manually online, offline, and switch the service group
among all systems.
Simulate failure of each resource in the service
group.
Simulate failover of the entire system.
VCS_3.5_Solaris_R3.5_2002091
5
10-31
Testing Examples
Force a resource to fault.
Reboot a system.
Halt and reboot a system.
Remove power from a system.
VCS_3.5_Solaris_R3.5_2002091
5
10-32
Summary
You should now be able to:
Describe how VCS responds to faults.
Implement failover policies.
Set limits and prerequisites.
Use system zones to control failover.
Control failover behavior using attributes.
Clear faults.
Probe resources.
Flush service groups.
Test failover.
VCS_3.5_Solaris_R3.5_2002091
5
10-33
Lab 10: Faults and Failovers

Student Red
Student Blue
BlueNFSSG
RedNFSSG
resfault
nofailover
sysoffline
VCS_3.5_Solaris_R3.5_2002091
5
Triggers
resfault
nofailover
sysoffline
10-34

Vcs35 Sol Les10

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Vcs35 Sol Les10

Hochgeladen von

Copyright:

Verfügbare Formate

VERITAS Cluster Server

3.5 for Solaris

How VCS Responds to Faults

Offline entire service group

Keep group partially online

Keep service group offline

Run NoFailover trigger

Start service group elsewhere

Priority Failover Policy

SystemList = {Svr1 = 0, Svr2 = 1}

SystemList = {Svr3=0, Svr1=1, Svr2=2}

SystemList = {Svr2 = 0, Svr1 = 1}

Round Robin Failover Policy

Load Failover Policy

Define system Capacity based on server capability.

We decide each of these servers has a

This server has a

Define system Capacity based on server capability.

Define group Load based on application requirements.

Determining the Failover Target

Tracing a Server Failure

The NFS server FAILS.

Completing Fail Over

VCS brings NFS2 online on the server with 175

Setting Load and Capacity

AvailableCapacity calculated by VCS:

Dynamic Load Balancing

External software monitors CPU utilization (30, 40, 75, and 80

The LoadWarning Trigger

Runs when system has been running at a specified percent of

Define system Limits based on the server properties:

This server has:

Service Group Prerequisites

Define system Limits based on the server properties.

Define service group Prerequisites based on application

Combining Capacity and Limits

Controlling Failover Behavior with

Affects how the agent responds to a resource fault

Modifying Resource Type

To unfreeze a service group:

To clear all faults in a service group, type:

Persistent resources are cleared by probing:

You can clear a persistent resource by probing it

Flushing Service Groups

Lab 10: Faults and Failovers

Das könnte Ihnen auch gefallen