Sie sind auf Seite 1von 4

Problem

Ways to prevent and reduce the effects of split-brain in VERITAS Cluster Server for UNIX

Solution

This document discusses split-brain with intent to indicate current and future options provided by
VERITAS Cluster Server (VCS) to prevent split-brain. Additional considerations for limiting
the effects of split-brain once it happens are also mentioned.

What is split brain? The following is taken from the VCS 3.5 User's Guide, and provides a
discussion on split-brain.

Network Partitions and Split-Brain


Under normal conditions, when a VCS system ceases heartbeat communication with its peers
due to an event such as power loss or a system crash, the peers assume the system has failed and
issue a new, "regular" membership excluding the departed system. A designated system in the
cluster then takes over the service groups running on the departed system, ensuring the
application remains highly available. However, heartbeats can also fail due to network failures.
If all network connections between any two groups of systems fail simultaneously, a network
partition occurs. When this happens, systems on both sides of the partition can restart
applications from the other side resulting in duplicate services, or "split-brain". A split brain
occurs when two independent systems configured in a cluster assume they have exclusive access
to a given resource (usually a file system or volume). The most serious problem caused by a
network partition is that it affects the data on shared disks. All failover management software
uses a predefined method to determine if its peer is "alive". If the peer is alive, the system
recognizes it cannot safely take over resources. Split brain occurs when the method of
determining peer failure is compromised. In virtually all failover management software (FMS)
systems, split-brain situations are rare. A true split brain means multiple systems are online and
have accessed an exclusive resource simultaneously.

Note Splitting communications between cluster nodes does not constitute a split brain. A split-
brain means cluster membership was affected in such a way that multiple systems use the same
exclusive resources, usually resulting in data corruption. The goal is to minimize the chance of a
system taking over an exclusive resource while another has it active, yet accommodate a system
powering off. In other words, a way to discriminate between a system that has failed and one that
is simply not communicating.

How VCS Avoids Split Brain


VCS uses heartbeats to determine the "health" of its peers. These can be private network
heartbeats, public (low-priority) heartbeats, and disk heartbeats. Regardless of the heartbeat
configuration, VCS determines that a system has faulted (due to power loss, kernel panic, etc.)
when all heartbeats fail simultaneously. For this method to work, the system must have two or
more functioning heartbeats and all must fail simultaneously. For VCS to encounter split brain,
the following events must occur:
• A service group must be online on a system in a cluster.
• The service group must have a system (or systems) designated in its SystemList
attribute as a potential failover target.
• All heartbeat communication between the system with the online service group
and the system designated as the potential takeover target must fail simultaneously
while the original system stays online.
• The potential takeover target must actually bring resources online that are
typically an exclusive, ownership-type item, such as disk groups, volume, or file
systems.

Jeopardy Defined
The design of VCS requires that a minimum of two heartbeat-capable channels be available
between nodes to protect against network failure. When a node is missing a single heartbeat
connection, VCS can no longer discriminate between a system loss and a loss of the last network
connection. It must then handle loss of communications on a single network differently from loss
on multiple networks. This procedure is called "jeopardy." As mentioned previously, low
latency transport (LLT) provides notification of reliable versus unreliable network
communications to global atomic broadcast (GAB). GAB uses this information, with or without
a functional disk heartbeat, to delegate cluster membership. If the system heartbeats are lost
simultaneously across all channels, VCS determines the system has failed. The services running
on that system are then restarted on another. However, if the node was running with one
heartbeat only (in jeopardy) prior to the loss of a heartbeat, VCS does not restart the applications
on a new node. This action of disabling failover is a safety mechanism that prevents data
corruption.

Split-Brain Prevention
What can be done to avoid split-brain? VCS provides a number of functions aimed at the
prevention of split-brain situations. The following list contains a brief explanation of each
prevention method.

Private Heartbeat - VERITAS recommends a minimum of two dedicated 100 megabit private
links between cluster nodes. These must be completely isolated from each other so the failure of
one heartbeat link cannot possibly affect the other.

Configuring private heartbeats to share any infrastructure is not recommended. Configurations


such as running two shared heartbeats to the same hub or switch, or using a single virtual local
area network (VLAN) to trunk between two switches induce a single point of failure in the
heartbeat architecture. The simplest guideline is "No single failure, such as power, network
equipment or cabling can disable both heartbeat connections."

Low-Priority Heartbeat - Heartbeat over public network does minimum traffic over the
network until you get down to one normal heartbeat remaining. Then it becomes a full
functional heartbeat.

Use of a low priority link is also recommended to provide further redundancy.


The low priority link prevents a jeopardy condition on loss of any single private link and
provides additional redundancy (consider low-pri heartbeat along with two private network
heartbeats).

Disk Heartbeat - With disk heartbeating configured, each system in the cluster periodically
writes to and reads from specific regions on a dedicated shared disk. This exchange consists of
heartbeating only, and does not include communication about cluster status.

With disk heartbeating configured in addition to the private network connections, VCS has
multiple heartbeat paths available. For example, if one of two private network connections fails,
VCS has the remaining network connection and the disk heartbeat region that allow heartbeats to
continue normally.

Service Group Heartbeats - Disk heartbeats that are checked before a service group is brought
online.
This is designed to further assist in preventing a data corruption problem. If for some reason, a
system comes up and prepares to take over a service group, a service group heartbeat configured
at the bottom of the dependency tree first checks if any other system is writing to the disk. The
local system, via the ServiceGroupHB agent, tries to obtain "ownership" of the available disks as
specified by the disks attribute. The system gains ownership of a disk when it determines that the
disk is available and not owned by another system.

SCSI II Disk Reservations - Reserves and monitors SCSI disks for a system, enabling a
resource to go online on that system, when using the DiskReservation agent. The agent supports
all SCSI II disks. Use this agent to specify a list of raw disk devices, and reserve all or a
percentage of accessible disks for an application. The reservation prevents disk data corruption
by restricting other systems from accessing and writing to the disks. An automatic probing
feature allows systems to maintain reservations even when the disks or bus are reset. The
optional FailFast feature minimizes data corruption in the event of a reservation conflict by
causing the system to panic.
Note: The DiskReservation agent is supported on Solaris 2.7 and above. The agent is not
supported with dynamic multipathing software, such as VERITAS DMP.

IP Checking - This method is used in either the preonline-ipc event trigger, or simply make an
IP resource the first resource to online in the service group. Both methods check to make sure
the IP addresses for this service group are not being used by another system before onlining the
service group.

Auto Disabling Service Groups - (non-configurable) When VCS does not know the status of a
service group on a particular system, it autodisables the service group on that system.
Autodisabling occurs under the following conditions:

• When the VCS engine, HAD, is not running on the system.


• When all resources within the service group are not probed on the system.
• When a particular system is visible through disk heartbeat only.
Under these conditions, all service groups that include the system in their SystemList attribute
are autodisabled. This does not apply to systems that are powered off.

When the VCS process (HAD) is killed, other systems in the cluster mark all service groups
capable of going online on the rebooted system as autodisabled. The AutoDisabled flag is
cleared when the system goes offline. As long as the system goes offline within the interval
specified in the ShutdownTimeout value, VCS treats this as a system reboot.

I/O Fencing SCSI III Reservations - I/O Fencing (VxFEN) is scheduled to be included in the
VCS 4.0 version. VCS can have parallel or failover service groups with disk group resources in
them. If the cluster has a split-brain, VxFEN should force one of the subclusters to commit
suicide in order to prevent data corruption. The subcluster which commits suicide should never
gain access to the disk groups without joining the cluster again. In parallel service groups, it is
necessary to prevent any active processes from writing to the disks. In failover groups, however,
access to the disk only needs to be prevented when VCS fails over the service group to another
node. Some multipathing products will be supported with I/O Fencing.

Minimizing the Effects of Split-Brain


In addition to avoiding split brain, there are utilities in place to help minimize the impact of
effects of split-brain should it still occur. The use of the concurrency trigger script, and the -j
option to the gabconfig command are listed below with brief descriptions.

Concurrency Violation Trigger Script - The violation trigger script that will offline a failover
service group that has resources online on more than one node at a time. Violation is invoked
when a resource (of a failover service group) is online on more than one node. This can happen,
when a resource goes online by itself while being online (thru VCS) on another node.

Gabconfig -j - If a network partition occurs, a cluster can "split" into two or more separate
mini-clusters. When two clusters join as one, VCS designates that one system be ejected. GAB
prints diagnostic messages and sends iofence messages to the system being ejected. The system
receiving the iofence messages tries to kill the client process. If the -j option is used in
gabconfig, the system is halted when the iofence message is received.

Das könnte Ihnen auch gefallen