Beruflich Dokumente
Kultur Dokumente
Ways to prevent and reduce the effects of split-brain in VERITAS Cluster Server for UNIX
Solution
This document discusses split-brain with intent to indicate current and future options provided by
VERITAS Cluster Server (VCS) to prevent split-brain. Additional considerations for limiting
the effects of split-brain once it happens are also mentioned.
What is split brain? The following is taken from the VCS 3.5 User's Guide, and provides a
discussion on split-brain.
Note Splitting communications between cluster nodes does not constitute a split brain. A split-
brain means cluster membership was affected in such a way that multiple systems use the same
exclusive resources, usually resulting in data corruption. The goal is to minimize the chance of a
system taking over an exclusive resource while another has it active, yet accommodate a system
powering off. In other words, a way to discriminate between a system that has failed and one that
is simply not communicating.
Jeopardy Defined
The design of VCS requires that a minimum of two heartbeat-capable channels be available
between nodes to protect against network failure. When a node is missing a single heartbeat
connection, VCS can no longer discriminate between a system loss and a loss of the last network
connection. It must then handle loss of communications on a single network differently from loss
on multiple networks. This procedure is called "jeopardy." As mentioned previously, low
latency transport (LLT) provides notification of reliable versus unreliable network
communications to global atomic broadcast (GAB). GAB uses this information, with or without
a functional disk heartbeat, to delegate cluster membership. If the system heartbeats are lost
simultaneously across all channels, VCS determines the system has failed. The services running
on that system are then restarted on another. However, if the node was running with one
heartbeat only (in jeopardy) prior to the loss of a heartbeat, VCS does not restart the applications
on a new node. This action of disabling failover is a safety mechanism that prevents data
corruption.
Split-Brain Prevention
What can be done to avoid split-brain? VCS provides a number of functions aimed at the
prevention of split-brain situations. The following list contains a brief explanation of each
prevention method.
Private Heartbeat - VERITAS recommends a minimum of two dedicated 100 megabit private
links between cluster nodes. These must be completely isolated from each other so the failure of
one heartbeat link cannot possibly affect the other.
Low-Priority Heartbeat - Heartbeat over public network does minimum traffic over the
network until you get down to one normal heartbeat remaining. Then it becomes a full
functional heartbeat.
Disk Heartbeat - With disk heartbeating configured, each system in the cluster periodically
writes to and reads from specific regions on a dedicated shared disk. This exchange consists of
heartbeating only, and does not include communication about cluster status.
With disk heartbeating configured in addition to the private network connections, VCS has
multiple heartbeat paths available. For example, if one of two private network connections fails,
VCS has the remaining network connection and the disk heartbeat region that allow heartbeats to
continue normally.
Service Group Heartbeats - Disk heartbeats that are checked before a service group is brought
online.
This is designed to further assist in preventing a data corruption problem. If for some reason, a
system comes up and prepares to take over a service group, a service group heartbeat configured
at the bottom of the dependency tree first checks if any other system is writing to the disk. The
local system, via the ServiceGroupHB agent, tries to obtain "ownership" of the available disks as
specified by the disks attribute. The system gains ownership of a disk when it determines that the
disk is available and not owned by another system.
SCSI II Disk Reservations - Reserves and monitors SCSI disks for a system, enabling a
resource to go online on that system, when using the DiskReservation agent. The agent supports
all SCSI II disks. Use this agent to specify a list of raw disk devices, and reserve all or a
percentage of accessible disks for an application. The reservation prevents disk data corruption
by restricting other systems from accessing and writing to the disks. An automatic probing
feature allows systems to maintain reservations even when the disks or bus are reset. The
optional FailFast feature minimizes data corruption in the event of a reservation conflict by
causing the system to panic.
Note: The DiskReservation agent is supported on Solaris 2.7 and above. The agent is not
supported with dynamic multipathing software, such as VERITAS DMP.
IP Checking - This method is used in either the preonline-ipc event trigger, or simply make an
IP resource the first resource to online in the service group. Both methods check to make sure
the IP addresses for this service group are not being used by another system before onlining the
service group.
Auto Disabling Service Groups - (non-configurable) When VCS does not know the status of a
service group on a particular system, it autodisables the service group on that system.
Autodisabling occurs under the following conditions:
When the VCS process (HAD) is killed, other systems in the cluster mark all service groups
capable of going online on the rebooted system as autodisabled. The AutoDisabled flag is
cleared when the system goes offline. As long as the system goes offline within the interval
specified in the ShutdownTimeout value, VCS treats this as a system reboot.
I/O Fencing SCSI III Reservations - I/O Fencing (VxFEN) is scheduled to be included in the
VCS 4.0 version. VCS can have parallel or failover service groups with disk group resources in
them. If the cluster has a split-brain, VxFEN should force one of the subclusters to commit
suicide in order to prevent data corruption. The subcluster which commits suicide should never
gain access to the disk groups without joining the cluster again. In parallel service groups, it is
necessary to prevent any active processes from writing to the disks. In failover groups, however,
access to the disk only needs to be prevented when VCS fails over the service group to another
node. Some multipathing products will be supported with I/O Fencing.
Concurrency Violation Trigger Script - The violation trigger script that will offline a failover
service group that has resources online on more than one node at a time. Violation is invoked
when a resource (of a failover service group) is online on more than one node. This can happen,
when a resource goes online by itself while being online (thru VCS) on another node.
Gabconfig -j - If a network partition occurs, a cluster can "split" into two or more separate
mini-clusters. When two clusters join as one, VCS designates that one system be ejected. GAB
prints diagnostic messages and sends iofence messages to the system being ejected. The system
receiving the iofence messages tries to kill the client process. If the -j option is used in
gabconfig, the system is halted when the iofence message is received.