Sie sind auf Seite 1von 60

1

High-Availability

HIGH-AVAILABILITY CONTROLLER CONFIGURATION


A high-availability configuration is two storage systems (nodes) whose controllers are connected to each
other either directly or through switches.

In cluster ontap the recommended cluster switch are NetApp CN1610, and CISCO NEXUS
5596
your cluster is upto 12 node you can use NetApp CN1610 switch. If your cluster is more

than 12 node the recommended switch is CISCO NEXUS 5596.


The nodes are connected to each other through a cluster adapter or NVRAM adapter, which allows one
node to serve data to the disks on its failed partner node. Each node continually monitors its partner,
mirroring data for the partners NVRAM.

HIGH-AVAILABILITY FEATURES
High-availability configurations provide fault tolerance and the ability to perform nondisruptive upgrades
and maintenance.
High-availability configurations provide the following benefits:
Fault toleranceWhen one node fails or becomes impaired, a takeover occurs and the partner node
continues to serve the data of the failed node.
Nondisruptive software upgradesWhen you halt one node and allow takeover, the partner node
continues to serve data for the halted node, allowing you to upgrade the halted node.
Nondisruptive hardware maintenanceWhen you halt one node and allow takeover, the partner

node continues to serve data for the halted node, allowing you to replace or repair hardware on the halted
node.

REQUIREMENTS FOR HIGH AVAILABILITY


The number of disks in a standard high-availability configuration must not exceed the maximum
configuration capacity. In addition, the total amount of storage attached to each node must not exceed the
capacity of a single node.
NOTE: When a failover occurs, the takeover node temporarily serves data from all the storage in the
high-availability configuration. When the single-node capacity limit is less than the total high-availability
configuration capacity limit, the total disk space in a cluster can be greater than the single-node capacity

3
limit. It is acceptable for the takeover node to temporarily serve more than the single-node capacity would
normally allow, as long as it does not own more than the single-node capacity.
Disks and disk-shelf compatibility
Both Fibre Channel (FC) and SATA storage is supported in standard high-availability configurations, as
long as the two storage types are not mixed on the same loop.
If needed, a node can have only FC storage and the partner node can have only SATA storage.
Cluster interconnect adapters and cables must be installed.
Nodes must be attached to the same network and the network interface cards must be configured
correctly.
System features such as CIFS, NFS, or SyncMirror software must be licensed and enabled on both
nodes.

PARTNER COMMUNICATION
To ensure that both nodes in a high-availability controller configuration maintain the correct and current
status of the partner node, heartbeat information and node status are stored on each node in the mailbox
disks. The mailbox disks are a redundant set of disks used in coordinating takeover or giveback
operations. If one node stops functioning, the surviving partner node uses the information on the mailbox
disks to perform takeover processing, which creates a virtual storage system. In the event of an
interconnect failure, the mailbox heartbeat information prevents an unnecessary failover from occurring.
Moreover, if cluster configuration information that is stored on the mailbox disks is out of sync during
boot, the high-availability controller nodes automatically resolve the situation. The FAS system failover
process is extremely robust, preventing split-brain issues from occurring.

HIGH-AVAILABILITY CONTROLLERS AND NVRAM


Data ONTAP uses the WAFL file system to manage data processing and NVRAM to guarantee data
consistency before committing writes to disks. If the storage controller experiences a power failure, the
most current data is protected by the NVRAM, and file system integrity is maintained.
In the high-availability controller environment, each node reserves half of the total NVRAM size for the
partner nodes data to ensure that exactly the same data exists in NVRAM on both storage controllers.
Therefore, only half of the NVRAM in the high-availability controller is dedicated to the local node. If
failover occurs, when the surviving node takes over the failed node, all WAFL checkpoints stored in
NVRAM are flushed to disk. The surviving node then combines the split NVRAM.
How the Interconnect Works
The interconnect adapters are a critical component in the high-availability controller configuration. Data
ONTAP uses these adapters to transfer system data between the partner nodes, which maintain data
synchronization in the NVRAM on both controllers. Other critical information is also exchanged through
the interconnect adapters, including the heartbeat signal, system time, and details about temporary disk
unavailability due to pending disk-firmware updates.

CONFIGURING HIGH AVAILABILITY


To add the license, enter the following command on both node consoles for each required license:
license add xxxxxx
where xxxxx is the license code you received for the feature
To reboot both nodes, enter the following command:
reboot
To enable the license, enter the following command on the local node console:
cf enable
To verify that controller failover is enabled, enter the following command on each node console:
cf status

SETTING MATCHING NODE OPTIONS


Because some Data ONTAP options need to be the same on both the local and partner node, you need to
check these options with the options command on each node and change them as necessary.
STEPS
1. View and note the values of the options on the local and partner nodes, using the following command
on each console:
options
The current option settings for the node are displayed on the console. Output similar to the following is
displayed:
autosupport.doit TEST
autosupport.enable on
2. Verify that the options with comments in parentheses are set to the same value for both nodes. The
comments are as follows:
Value might be overwritten in takeover
Same value required in local+partner

6
Same value in local+partner recommended

3. Correct any mismatched options using the following command:


options option_name option_value

TAKEOVER OPERATION
When a takeover occurs, the functioning partner node takes over the functions and disk drives of the
failed node by creating an emulated storage system that:
Assumes the identity of the failed node
Accesses the failed nodes disks and serves its data to clients
The partner node maintains its own identity and its own primary functions, but also handles the added
functionality of the failed node through the emulated node.

GIVEBACK OPERATION
After a partner node is repaired and operating normally, you can use the cf giveback command to
return operations to the partner.
When the failed node is functioning again, the following events can occur:
You initiate a cf giveback command that terminates the emulated node on the partner.
The failed node resumes normal operation, serving its own data.
The high-availability configuration resumes normal operation, with each node ready to take over for
its partner if the partner fails.

10

11

12

13

14

NEGOTIATED FAILOVER
To enable negotiated failover in the event of a failed network interface, you must explicitly enable the
cf.takeover.on_network_interface_failure option, set the failover policy, and mark
each interface that can trigger a negotiated failover (NFO).
NOTE: The cf.takeover.on_network_interface_failure.policy option must be set
manually on each controller in a high-availability pair: all_nics= ALL interfaces marked for failover
must fail before takeover will occur any_nic= ANY interface marked for failover will trigger a highavailability takeover. The use of the cf.takeover.on_network_interface_failure option
is not the first line of defense against a network switch being a single point of failure. This option should
only be considered when a single-mode vif or second-level vif cannot be used. Controller failover is

15
disruptive to CIFS clients and can be disruptive to NFS clients using soft mounts. However, vif failover is
completely nondisruptive and is therefore the preferred method. However, negotiated failover is used
increasingly in a MultiStore environment.

BEST PRACTICES
General best practices require comprehensive testing of all mission-critical systems before introducing
them into a production environment. High-availability controller testing should include takeover and
giveback, or functional testing as well as performance evaluation. Extensive testing validates planning.
Monitor network connectivity and stability.
Unstable networks not only affect total takeover and giveback times, they adversely affect all devices on
the network in various ways. NetApp storage controllers are typically connected to the network to serve
data, so if the network is unstable, the first symptom is degradation of storage-controller performance and
availability. Client service requests are retransmitted many times before reaching the storage controller,
appearing to the client as slow responses from the storage controller. In a worst-case scenario, an unstable
network can cause communication to time-out, and the storage controller appears to be unavailable.
During takeover and giveback operations in the high-availability controller environment, storage
controllers attempt to connect to numerous types of servers on the network, including Windows domain
controllers, DNS, NIS, LDAP, and application servers. If these systems are unavailable or the network is
unstable, the storage controller continues to retry establishing communications, which delays takeover or
giveback times.
@@@@@@

What an HA pair is
An HA pair is two storage systems (nodes) whose controllers are connected to each other either
directly or, in the case of a fabric-attached MetroCluster, through switches and FC-VI interconnect
adapters. In this configuration, one node can take over its partner's storage to provide continued data
service if the partner goes down.

16

You can configure the HA pair so that each node in the pair shares access to a common set of
storage, subnets, and tape drives, or each node can own its own distinct set of storage.
The controllers are connected to each other through an HA interconnect. This allows one node to
serve data that resides on the disks of its failed partner node. Each node continually monitors its
partner, mirroring the data for each others nonvolatile memory (NVRAM or NVMEM). The
interconnect is internal and requires no external cabling if both controllers are in the same chassis.
Takeover is the process in which a node takes over the storage of its partner. Giveback is the process
in which that storage is returned to the partner. Both processes can be initiated manually or
configured for automatic initiation.

Enabling HA mode capability and controller failover


The HA license is no longer required in Data ONTAP 8.2. You must manually configure each node
to enable or disable HA pair (high-availability) mode capability and controller failover.
Steps

1. Enter the following command on each of the node consoles to enable HA mode capability:
optionscf.modeha

2. Enter the following command on each of the node consoles to reboot the nodes:
reboot

3. Enter the following command on either of the node consoles to enable controller failover:
cfenable

4. Verify that controller failover is enabled by entering the following command on each node
console:
cfstatus

The system displays the following output if controller failover is enabled:


ControllerFailoverenabled,Node2isup.

When takeovers occur


Takeovers can be initiated manually or occur automatically when a failover event happens,
depending on how you configure the HA pair. In some cases, takeovers occur automatically,
regardless of configuration.
Takeovers can occur under the following conditions:
A takeover is manually initiated.
A node is in an HA pair with the default configuration for immediate takeover on panic, and
that node undergoes a software or system failure that leads to a panic.
By default, the node automatically performs a giveback, returning the partner to normal
operation after the partner has recovered from the panic and booted up.
A node that is in an HA pair undergoes a system failure (for example, a loss of power) and
cannot reboot.
Note: If the storage for a node also loses power at the same time, a standard takeover is not

possible. For MetroCluster configurations, you can initiate a forced takeover in this situation.

17

One or more network interfaces that are configured to support failover become unavailable.
A node does not receive heartbeat messages from its partner.
This could happen if the partner experienced a hardware or software failure that did not result in
a panic but still prevented it from functioning correctly.
You halt one of the nodes without using the fparameter.
You reboot one of the nodes without using the fparameter.
Hardware-assisted takeover is enabled and triggers a takeover when the remote management
device (RLM or Service Processor) detects failure of the partner node.

How hardware-assisted takeover speeds up takeover


Hardware-assisted takeover speeds up the takeover process by using a node's remote
management device (SP or RLM) to detect failures and quickly initiate the takeover rather than
waiting for Data ONTAP to recognize that the partner's heartbeat has stopped.
Without hardware-assisted takeover, if a failure occurs, the partner waits until it notices that the
node is no longer giving a heartbeat, confirms the loss of heartbeat, and then initiates the
takeover.
The hardware-assisted takeover feature uses the following process to take advantage of the
remote management device and avoid that wait:
1. The remote management device monitors the local system for certain types of failures.
2. If a failure is detected, the remote management device immediately sends an alert to the
partner node.
3. Upon receiving the alert, the partner initiates takeover.
The hardware-assisted takeover option (cf.hw_assist.enable) is enabled by default.

What happens during takeover


When a takeover occurs, the unimpaired partner node takes over the functions and disk drives of
the failed node by creating an emulated storage system.

The emulated system performs the following tasks:


Assumes the identity of the failed node
Accesses the failed nodes disks, array LUNs, or both, and serves its data to clients
The partner node maintains its own identity and its own primary functions, but also handles the
added functionality of the failed node through the emulated node.
Note: When a takeover occurs, existing CIFS sessions are terminated. A graceful shutdown of

the CIFS sessions is not possible, and some data loss could occur for CIFS users.
If the node doing the takeover panics
If the node that is performing the takeover panics within 60 seconds of initiating takeover, the
following events occur:
The node that panicked reboots.
After it reboots, the node performs self-recovery operations and is no longer in takeover mode.
Failover is disabled.

What happens during giveback

18

The local node returns ownership of the aggregates and volumes to the partner node after any
issues on the partner node are resolved or maintenance is complete. In addition, the local node
returns ownership when the partner node has booted up and giveback is initiated either manually
or automatically.
When the failed node is functioning again, the following events can occur:
You issue a cfgivebackcommand that terminates the emulated node on the partner.
The failed node resumes normal operation, serving its own data.
The HA pair resumes normal operation, with each node ready to take over for its partner if the
partner fails.

Planning your HA pair configuration


Best practices for HA pairs
To ensure that your HA pair is robust and operational, you need to be familiar with configuration
best practices.
Make sure that each power supply unit in the storage system is on a different power grid so that
a single power outage does not affect all power supply units.
Use interface groups (virtual interfaces) to provide redundancy and improve availability of
network communication.
Maintain consistent configuration between the two nodes.

An inconsistent configuration is often the cause of failover problems.


Test the failover capability routinely (for example, during planned maintenance) to ensure proper
configuration.
Make sure that each node has sufficient resources to adequately support the workload of both
nodes during takeover mode.
Use the Config Advisor tool to help ensure that failovers are successful.

If your system supports remote management (through an RLM or Service Processor), make
sure that you configure it properly,
Follow recommended limits for FlexVol volumes, dense volumes, Snapshot copies, and LUNs to
reduce the takeover or giveback time.
When adding traditional or FlexVol volumes to an HA pair, consider testing the takeover and
giveback times to ensure that they fall within your requirements.
For systems using disks, check for failed disks regularly and remove them as soon as possible, as
described in the Data ONTAP Storage Management Guide for 7-Mode. Failed disks can extend
the duration of takeover operations or prevent giveback operations.
Multipath HA is required on all HA pairs except for some FAS22xx system configurations, which
use single-path HA and lack the redundant standby connections.

Setup requirements and restrictions for standard HA pairs


You must follow certain requirements and restrictions when setting up a new standard HA pair.
These requirements help you ensure the data availability benefits of the HA pair design.
The following list specifies the requirements and restrictions you should be aware of when setting up
a new standard HA pair:
Architecture compatibility
Both nodes must have the same system model and be running the same Data ONTAP software

19
and system firmware versions. See the Data ONTAP Release Notes for 7-Mode for the list of
supported systems.
Nonvolatile memory (NVRAM or NVMEM) size and version compatibility
The size and version of the system's nonvolatile memory must be identical on both nodes in an
HA pair.
Storage capacity
The number of disks or array LUNs must not exceed the maximum configuration capacity. If
your system uses both native disks and array LUNs, the combined total of disks and array LUNs
cannot exceed the maximum configuration capacity. In addition, the total storage attached to each
node must not exceed the capacity for a single node.
Disks and disk shelf compatibility
FC, SATA, and SAS storage are supported in standard HA pairs.
FC disks cannot be mixed on the same loop as SATA or SAS disks.
Different connection types cannot be combined in the same loop or stack.
Different types of storage can be used on separate stacks or loops on the same node. You can
also dedicate a node to one type of storage and the partner node to a different type,
Multipath HA is required on all HA pairs except for some FAS22xx system configurations,

which use single-path HA and lack the redundant standby connections.


Mailbox disks or array LUNs on the root volume
Two disks are required if the root volume is on a disk shelf.
One array LUN is required if the root volume is on a storage array.
The mailbox disks and LUNs are used for the following tasks:
Maintaining consistency between the nodes in the HA pair
Continually checking whether the other node is running or whether it has performed a
takeover
Storing configuration information that is not specific to any particular node
HA interconnect adapters and cables must be installed unless the system has two controllers in
the chassis and an internal interconnect.
Nodes must be attached to the same network and the Network Interface Cards (NICs) must be
configured correctly.
The same system software, such as Common Internet File System (CIFS) or Network File System
(NFS), must be licensed and enabled on both nodes.
Note: If a takeover occurs, the takeover node can provide only the functionality for the licenses

installed on it. If the takeover node does not have a license that was being used by the partner
node to serve data, your HA pair loses functionality after a takeover.
For an HA pair using array LUNs, both nodes in the pair must be able to detect the same array
LUNs.
However, only the node that is the configured owner of a LUN has read-and-write access to that
LUN. During takeover operations, the emulated storage system maintains read-and-write access
to the LUN.

What multipath HA for HA pairs is


Multipath HA provides redundancy for the path from each controller to every disk shelf in the
configuration. It is the preferred method for cabling a storage system. An HA pair without multipath
HA has only one path from each controller to every disk, but an HA pair with multipath HA has two
paths from each controller to each disk, regardless of which node owns the disk.

20

The following diagram shows the connections between the controllers and the disk shelves for an
example HA pair using multipath HA. The redundant primary connections and the redundant standby
connections are the additional connections required for multipath HA for HA pairs.

How the connection types are used


A multipath HA configuration uses primary, redundant and standby connections to ensure continued
service in the event of the failure of an individual connection.
The following table outlines the connection types used for multipath HA for HA pairs, and how the
connections are used:

Advantages of multipath HA
Multipath connections in an HA pair reduce single-points-of-failure.
By providing two paths from each controller to every disk shelf, multipath HA provides the
following advantages:

21

The loss of a disk shelf module, connection, or host bus adapter (HBA) does not require a
failover.
The same storage system can continue to access the data using the redundant path.
The loss of a single disk shelf module, connection, or HBA does not prevent a successful
failover.
The takeover node can access its partners disks using the redundant path.
You can replace modules without having to initiate a failover.

Requirements for hardware-assisted takeover


The hardware-assisted takeover feature is available only on systems with an RLM or SP module
configured for remote management. Remote management provides remote platform management
capabilities, including remote access, monitoring, troubleshooting, logging, and alerting features.
Although a system with remote management on both nodes provides hardware-assisted takeover for
both, hardware-assisted takeover is also supported on HA pairs in which only one of the two systems
has remote management configured. Remote management does not have to be configured on both
nodes in the HA pair. Remote management can detect failures on the system in which it is installed
and provide faster takeover times if a failure occurs on the system with remote management.

Managing takeover and giveback


An HA pair allows one partner to take over the storage of the other, and return the storage using the
giveback operation. Management of the nodes in the HA pair differs depending on whether one
partner has taken over the other, and the takeover and giveback operations themselves have different
options.
This information applies to all HA pairs regardless of disk shelf type.

Monitoring an HA pair in normal mode


You can display information about the status and configuration of HA pair in normal mode (when

22
neither node has taken over the other).

Monitoring HA pair status


You can use commands on the local node to determine whether the controller failover feature is
enabled and whether the other node in the HA pair is up.
Step

1. Enter the following command:


cfstatus

The following example shows that the HA pair is enabled and the interconnect is up and working
correctly:
node1>cfstatus
ControllerFailoverenabled,node2isup.
RDMAInterconnectisup(Link0up).

If the output shows that one link is down, the HA pair is degraded and you must configure the
link so that it is up while the other link is still active.
Note: Depending on the storage system model, the output might display either RDMA
interconnector VIAinterconnectin the last line.
Note: Data ONTAP can disable controller failover if a software or hardware problem exists
that prevents a successful takeover. In this case, the message returned from the cfstatus
command describes the reason failover is disabled.

Description of HA pair status messages


The cfstatuscommand displays information about the status of the HA pair.

23

Monitoring the hardware-assisted takeover feature


You can check and test the hardware-assisted takeover configuration using the hw_assist
command. You can also use the command to review statistics relating to hardware-assisted
takeover.
Checking the hardware-assisted takeover status of the local and partner node
You can check the status of the hardware-assisted takeover configuration with the cf
hw_assiststatuscommand. It shows the current status for the local and partner nodes.

24

Testing the hardware-assisted takeover configuration


You can test the hardware-assisted takeover configuration with the cfhw_assisttestcommand.
About this task
The cfhw_assisttestcommand sends a test alert to the partner. If the alert is received the

partner sends back an acknowledgment, and a message indicating the successful receipt of the test
alert is displayed on the console.
Step

1. Enter the following command to test the hardware-assisted takeover configuration:


cfhw_assisttest
After you finish

Depending on the message received from the cfhw_assisttestcommand, you might need to
reconfigure options so that the HA pair and the remote management card are operating.

Checking hardware-assisted takeover statistics


You can display statistics about hardware-assisted takeovers to determine how many alert events
of each type have been received from the partner.
Step

1. Enter the following command to display or clear the hardware-assisted takeover statistics,
respectively:
cfhw_assiststats
cfhw_assiststatsclear

Displaying the partner's name

25

You can display the name of the other node with the cfpartnercommand.
Step

1. Enter the following command:


cfpartner
Note: If the node does not yet know the name of its partner because the HA pair is new, this
command returns partner.

Displaying disk and array LUN information on an HA pair


To find out about the disks, array LUNs, or both on both the local and partner node, you can use
the
sysconfigand aggrstatuscommands, which display information about both nodes.

Configuring automatic takeover


You can control when automatic takeovers occur by setting the appropriate options

Reasons that automatic takeover occurs


You can set options to control whether automatic takeovers occur due to different system errors.
In some cases, automatic takeover occurs by default unless you disable the option, and in some
cases automatic takeover cannot be prevented.
Takeovers can happen for several reasons. Some system errors cause a takeover; for example,
when a system in an HA pair loses power, it automatically fails over to the other node.
However, for some system errors, a takeover is optional, depending on how you set up your HA
pair. The following table outlines which system errors can cause a takeover to occur, and whether
you can configure the HA pair for that error.

26

27

Commands for performing a manual takeover


You must know the commands you can use when initiating a takeover. You can initiate a
takeover on a node in an HA pair to perform maintenance on that node while still serving the
data on its disks, array LUNs, or both to users.

28

Halting a node without takeover


You can halt the node and prevent its partner from taking over.
About this task

You can halt the node and prevent its partner from taking over. For example, you might need to
perform maintenance on both the storage system and its disks and want to avoid an attempt by
the partner node to write to those disks.
Step

1. Enter the following command:


haltf

Rebooting a node without takeover


You can reboot the node and prevent its partner from taking over, overriding the
cf.takeover.on_rebootoption.
Step

1. Enter the following command:


rebootf

Enabling and disabling takeover


You might want to use the cfdisablecommand to disable takeover if you are doing
maintenance that typically causes a takeover. You can reenable takeover with the cfenable
command after you finish maintenance.

29
Step

1. Enter the following command:


cfenable|disable
Use cfenableto enable takeover or cfdisableto disable takeover.
Note: You can enable or disable takeover from either node.

Enabling and disabling takeover on reboot


The takeover on reboot option enables you to control whether an automatic takeover occurs when a
node reboots. This automatic takeover, and the automatic giveback that follows after the reboot is
complete, can reduce the outage during which the storage belonging to the rebooting system is
unavailable.
About this task

If this option is enabled and a takeover occurs because of a reboot, then an automatic giveback is
performed after the partner has booted. This giveback occurs even if the
cf.giveback.auto.enableoption is set to off. However, if a node takes over its partner due to a
reboot and that node itself reboots before it can execute a giveback, it performs automatic giveback
only if cf.giveback.auto.enableis set to on.
If the cf.takeover.on_rebootis off and a node is rebooted then the partner will not take over
immediately. But the partner could take over later if the node takes more than 180 seconds to boot.
Note: If the rebootfcommand is used, then the partner does not take over under any

circumstances, even if the reboot timer expires.


Step

1. Enter the following command:


optionscf.takeover.on_rebooton
The default is on, unless FC or iSCSI is licensed, in which case the default is off.
Note: If you enter this command on one node, the value applies to both nodes.

This option is persistent across reboots.

Enabling and disabling automatic takeover of a panicked partner


Data ONTAP is configured by default to initiate a takeover immediately if the partner node panics.
This shortens the time between the initial failure and the time that service is fully restored because
the takeover can be quicker than the recovery from the panic, although the subsequent giveback
causes another brief outage.
About this task

If you enter this command on one node, the value applies to both nodes.
The setting of this option is persistent across reboots.
By default, Data ONTAP will initiate an automatic giveback after a takeover on panic.
The cf.giveback.auto.after.panic.takeoveroption can be used to disable this
automatic giveback.
Steps

1. Verify that controller takeover is enabled by entering the following command:


cfenable

30
2. Enable or disable automatic takeover on panic by entering the following command:
optionscf.takeover.on_panic{on|off}
onenables immediate takeover of a panicked node. This is the default value.
offdisables immediate takeover of a panicked node. If you disable this option, normal takeover

procedures apply: if a node panics and stays down without sending messages to its partner for 15
seconds, the partner then automatically takes over the failed node.

Specifying the time period before takeover


You can specify how long (in seconds) a partner in an HA pair can be unresponsive before the other
partner takes over.
About this task

Both partners do not need to have the same value for this option. Thus, you can have one partner that
takes over more quickly than the other.
Note: If your HA pair is failing over because one of the nodes is too busy to respond to its partner,
increase the value of the cf.takeover.detection.secondsoption on the partner.
Step

1. Enter the following command:


optionscf.takeover.detection.secondsnumber_of_seconds
The valid values for number_of_secondsare 10 through 180; the default is 15.
Note: If the specified time is less than 15 seconds, unnecessary takeovers can occur, and a core

might not be generated for some system panics. Use caution when assigning a takeover time of
less than 15 seconds.

Enabling or disabling negotiated failover for a network interface


You can enable or disable negotiated failover for a network interface to trigger automatic takeover if
the interface experiences a persistent failure. You can use the nfooption of the ifconfigcommand
to enable or disable negotiated failover.
About this task

You can specify the nfooption for an interface group. However, you cannot specify the nfooption
for any underlying physical interface of the interface group.
Steps

1. To enable takeover during interface failure, enter the following command:


optionscf.takeover.on_network_interface_failureon

2. To enable or disable negotiated failover, enter the following command:


ifconfiginterface_name{nfo|nfo}
interface_nameis the name of the network interface.
nfoenables negotiated failover.
nfodisables negotiated failover.
Example

To enable negotiated failover on the interface e8 of an HA configuration, enter the following


command:
ifconfige8nfo
Note: The nfooption is persistent across reboots after it is enabled on an interface.

31

Managing an HA pair in takeover mode


You manage an HA pair in takeover mode by performing a number of management actions.

Determining why takeover occurred


You can use the cfstatuscommand to determine why a takeover occurred.
Step

1. At the takeover prompt, enter the following command:


cfstatus
Result

This command can display the following information:


Whether controller failover is enabled or disabled
Whether a takeover is imminent due to a negotiated failover
Whether a takeover occurred, and the reason for the takeover

Statistics in takeover mode


Explains differences in system statistics when in takeover mode.
In takeover mode, statistics for some commands differ from the statistics in normal mode in the
following ways:
Each display reflects the sum of operations that take place on the takeover node plus the
operations on the failed node.
The display does not differentiate between the operations on the takeover node and the
operations on the failed node.
The statistics displayed by each of these commands are cumulative.
After giving back the failed partners resources, the takeover node does not subtract the
statistics it performed for the failed node in takeover mode.
The giveback does not reset (zero out) the statistics.
To get accurate statistics from a command after a giveback, you can reset the statistics as
described in the man page for the command you are using.
Note: You can have different settings on each node for SNMP options, but any statistics gathered

while a node was taken over do not distinguish between nodes.

Giveback operations
Giveback can be implemented and configured in a number of different ways. It can also be
configured to occur automatically.

About manual giveback


You can perform a normal giveback, a giveback in which you terminate processes on the partner
node, or a forced giveback.
Option for shortening giveback time
You can shorten the client service outage during giveback by using the
cf.giveback.check.partneroption. You should always set this option to on.

32

Removing failed disks prior to attempting giveback


For taken-over systems that use disks, you must remove the failed disk or disks prior to attempting to
implement giveback.
Step

1. Remove the failed disks, as described in the Data ONTAP Storage Management Guide for 7Mode.
After you finish

When all failed disks are removed or replaced, proceed with the giveback operation.
Initiating normal giveback
You can return control to a taken-over partner with the cfgivebackcommand.
Forcing giveback
Because the takeover node might detect an error condition on the failed node that typically prevents a
complete giveback such as data not being flushed from NVRAM to the failed nodes disks, you can
force a giveback, if necessary.
About this task

You can use this procedure to force the takeover node to give back the resources of the failed node
even if the takeover node detects an error that typically prevents a complete giveback.
Note: The cfforcegivebackcommand should be used with caution because it can cause a loss
of data. If you cannot risk loss of data and are unable to complete the giveback, contact technical
support.
Steps

1. On the takeover node, enter the following command:


cfgivebackf
The fparameter allows giveback to proceed as long as it would not result in data corruption or

an error on the storage system.


2. If giveback is still not successful, and if you can risk possible loss of data, enter the following
command on the takeover node:
cfforcegiveback
Attention: Use cfforcegivebackonly when you cannot get cfgivebackfto

succeed.
When you use this command, you risk losing any data committed to NVRAM but not to disk.
If a cifsterminatecommand is running, allow it to finish before forcing a giveback.
If giveback is interrupted
If the takeover node experiences a failure or a power outage during the giveback process, that
process stops and the takeover node returns to takeover mode until the failure is repaired or the
power is restored.
However, this depends upon the stage of giveback in which the failure occurred. If the node
encountered failure or a power outage during partial giveback state (after it has given back the
root aggregate), it will not return to takeover mode. Instead, the node returns to partial-giveback
mode. If this occurs, complete the process by repeating the giveback operation.

33

Configuring giveback
You can configure how giveback occurs, setting different Data ONTAP options to improve the speed
and timing of giveback.

Configuring automatic giveback


You can enable automatic giveback by using the cf.giveback.auto.enablecommand
Step

1. Enter the following command to enable automatic giveback:


optioncf.giveback.auto.enableon
The onvalue enables automatic giveback. The offvalue disables automatic giveback. This
option is offby default.

Adjusting the giveback delay time for automatic giveback


By default, there is a 600-second minimum time that a node stays in the takeover state before
performing an automatic giveback. This delay reduces the overall outage that can occur while the
taken-over partner reboots. Instead of a single longer outage, there are two brief outages (first
when the partner is taken over, the second when giveback occurs). This option affects all types of
automatic giveback but does not affect manual giveback.
Step

1. Enter the following command:


optionscf.giveback.auto.delay.secondsnumberofseconds
The valid values for number_of_secondsare 0 to 600. The default is 600.
Attention: If cf.giveback.auto.delay.secondsis set to 0, the combined outage during

takeover and giveback results in a long total client outage.


Setting giveback delay time for CIFS clients
You can specify the number of minutes to delay an automatic giveback before the system
terminates
CIFS clients that have open files.
About this task

During the delay, the system periodically sends notices to the affected clients. If you specify 0,
CIFS
clients are terminated immediately.
This option is used only if automatic giveback is enabled.
Step

1. Enter the following command:


optionscf.giveback.auto.cifs.terminate.minutesminutes
Valid values for minutesare 0 through 999. The default is 5 minutes.

34

35

36

HIGH-AVAILABILITY CONFIGURATION
In the past, NetApp used the term active-activeto describe the high-availability (HA) controller
failover configuration. Two controller heads are configured as a pair, with each node providing failover
support for its partner.
CONFIGURATION

High-availability is a licensed feature that must be enabled before it can be configured and used. The
high-availability feature must be licensed on both HA nodes.
License the first node.
license add <licnum>
License the second node.
license add <licnum>
NOTE: You must then reboot and start the controller failover feature.
After the high-availability feature is enabled, you can unlicense it only when the HA pair is in a normal
state and the controller failover services are manually disabled.
Before you enable the high-availability feature, you must configure various settings. For example, the HA
pair interconnect cable must be attached, both controllers must be provided Fibre connections to all
expansion drawers, and both controllers must be provided access to the same IP subnets.
The high-availability feature activates numerous high availability capabilities, such as NVRAM
mirroring, which enables the controllers to provide failover support for each other.
For example, if a controller failure occurs:

37
The surviving node spawns a virtual instance of the failed node
The virtual node accesses its mirrored NVRAM to complete any interrupted write
The local network interface assumes the IP address of both the local and partner interfaces (for
Ethernet traffic)
The local FC interfaces retain their original WWPN addresses, and the host-based MPIO drivers direct
all FC traffic via the interfaces (assuming that single-image cfmode is being used)
The process of removing a high-availability configuration is as follows:
1. Disable the controller failover feature (cf disable).
2. Delete the controller failover license (license delete ).
3. Remove the partners network entries from the /etc/rc file.
4. Halt, and make sure the partner-sysid is blank.
5. Power down and remove or relocate the controller failover interconnect card.
6. Repeat the process on the other controller
ADMINISTRATION

In a high-availability configuration, all disks are visible to both controllers. Before a disk can be used in
an aggregate or a spare, it must be assigned to one or the other controller. This process is known as
software disk assignment.If a disk is not assigned to a controller (in order words, if it is listed as not
owned), then it cannot be used by either controller for any purpose.
Use the following commands to manage disk ownership:
Assign disk ownership
disk assign
List disk ownership (several methods)
disk show v
storage show disk
sysconfig r
NOTE: Earlier versions of Data ONTAP supported hardware disk assignment, where ownership was
determined by the Fibre Channel cabling topology. This mode is not supported on any current-generation
controller.
Generally, administration of a high-availability configuration and administration of two non-clustered
controllers are identical. The clustered controllers are managed separately, although some configuration
settings must be synchronized between the two controllers. One of the features that you must master is the
process of HA pair failover and failback.
For example, after a failed controller is rebooted and ready to assume its old identity and workload, it
displays a waiting for givebackor "waiting for mb giveback" message. At this point, the administrator
enters the cf giveback command on the operational controller to return the failed controller back to
the normal state.
NOTE: For more information about controller failover management, refer to the product manuals.

38
PERFORMANCE

Optimum performance is usually achieved when the controllers in a high-availability configuration share
the client workload evenly. An even distribution of the workload is usually attributable to good solution
planning and to automatic load balancing in the host-based MPIO drivers (for FC traffic).
In an FC SAN environment, ensure that the host-based multipathing support is correctly configured.
Where appropriate, use Asymmetric Logical Unit Access (ALUA) support.
In most other ways, the performance concerns of a high-availability configuration and of a non-clustered
configuration are identical.
SECURITY

In almost all aspects of security, a high-availability configuration and a non-clustered configuration are
identical.
TROUBLESHOOTING

In a high-availability configuration, both controllers require connectivity to all of the disk expansion
shelves. It is not possible to have a shelf connected to one controller and not to the other controller. If a
controller loses access to one of the disk shelves, a negotiated (clean, but automatic) failover is triggered.
It is recommended that multipath HA cabling be used for the disk-expansion shelf connections. The
cabling prevents unnecessary controller failover for non-critical reasons, such as SFP failure, loop breaks,
ESH module failure, and so on.
@@@@@@

Ways to verify the configuration for HA pairs


There are two ways you can check your high-availability configuration before placing the pair
online: running the Config Advisor tool or using the command-line interface. You also need to
review the requirements for configuring your HA pair.
When you configure HA pair , the following configuration information needs to be the same on
both systems:

Parameters

Network interfaces

Configuration files

Licenses and option settings

You must set options cf.mode to ha for HA pairs.


@@@@@@@@@@
HA cables. c0a, c0b. : 10G ports (c0x) in 3200 are only for internal use of Data ONTAP as a
cluster interconnect and not user configurable. You cannot use these ports for your own data.
Please boot into maintenance mode and use ha-config show to check current HA settings.

39

Checking the hardware-assisted takeover status of the local and partner node
You can check the status of the hardware-assisted takeover configuration with the
cfhw_assiststatuscommand. It shows the current status for the local and partner nodes.
Example of hardware-assisted takeover status
The following example shows output from the cfhw_assiststatuscommand:
LocalNodeStatusha1
Active:Monitoringalertsfrompartner(ha2)
port4004IPaddress172.27.1.14
PartnerNodeStatusha2
Active:Monitoringalertsfrompartner(ha1)
port4005IPaddress172.27.1.15

Verifying and setting the HA state on the controller modules


and chassis
For systems that use the HA state value, the value must be consistent in all components in the HA
pair. You can use the Maintenance mode haconfigcommand to verify and, if necessary, set the
HA state.
About this task
The haconfigcommand only applies to the local controller module and, in the case of a dualchassis

HA pair, the local chassis. To ensure consistent HA state information throughout the system,
you must also run these commands on the partner controller module and chassis, if necessary.
The HA state is recorded in the hardware PROM in the chassis and in the controller module. It must
be consistent across all components of the system,
Steps

1. Reboot the current controller module and press Ctrl-C when prompted to display the boot
menu.
2. At the boot menu, select the option for Maintenance mode boot.
3. After the system boots into Maintenance mode, enter the following command to display the
HA state of the local controller module and chassis:
haconfigshow

The HA state should be hafor all components if the system is in an HA pair.


4. If necessary, enter the following command to set the HA state of the controller:
haconfigmodifycontrollerhastate

5. If necessary, enter the following command to set the HA state of the chassis:
haconfigmodifychassishastate

6. Exit Maintenance mode by entering the following command:


halt

7. Boot the system by entering the following command at the boot loader prompt:
boot_ontap

40

8. If necessary, repeat the preceding steps on the partner controller module.

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
1)It is possible to do a safely shutdown with a Netapp FAS-2240 include two controllers and HA
we have a Netapp FAS-2240 with two controllers and HA. Now next week we must
do a complete shutdown. How we can perform this safely?

A) Answer

In what order to power elements up and down during a power cycle procedure?
Data ONTAP 7-Mode procedure to bring system down:
If the storage system is clustered, enter cf disable. Alternatively, follow the section
'For a clustered storage system...' given below Step 4.
Notify, disconnect and, if needed, shut down all of the connected CIFS/NFS clients.
If there are any hosts that have FCP or iSCSI-based LUNs, shut them down before
shutting down the storage system.
Terminate CIFS with the cifs terminate command.
Run the halt command at the storage system command line interface. Allow the
storage system to terminate Data ONTAP, and then to the 'ok', 'cfe' or 'loader'
prompt.
For a clustered storage system if you did not run cf disable: run the halt -f
-t command on each of the partners.
Physically power down the head, then all the attached disk shelves as needed.
Physically unplug the cables from power supplies on the back of the storage
systems and shelves, to avoid any electrical issues when external power is restored.
Clustered Data ONTAP procedure to bring system down:
Notify, disconnect and, if needed, shut down all of the connected CIFS/NFS clients.
If there are any hosts that have FCP or iSCSI-based LUNs, shut them down before
shutting down the storage system.
If running ONTAP version prior to 8.2, perform the following steps:
If on a 2-node cluster, run the following:
::> cluster ha modify configured false
::> storage failover modify node * -enabled false

If on a 4+-node cluster, run the following:

::> storage failover modify -node * -enabled false

Log in to all nodes, one at a time (preferably using serial console or RLM/SP) and
run:
::> halt local -inhibit-takeover true

The following will appear after running the halt command above. Type 'y' when
prompted if you want to continue:
(system node halt)

41

Warning: Rebooting or halting node "node-01" in an HA enabled cluster


with takeover inhibited may result in a data serving failure and
client disruption. To ensure continuity of service, do the following
before rebooting or halting the node: Disable cluster HA using the
command: "cluster ha modify -configured false".
To transfer epsilon to the partner node, use the following commands
(privilege:advanced):
cluster modify -epsilon false -node <local-node>
cluster modify -epsilon true -node <partner-node>
Do you want to continue? {y|n}: y

Every node might take several minutes to shut down. Each node should then reset
and return to the LOADER> prompt. If there is no console or RLM/SP access, you
should confirm the overall node down status before halting the final node, by
running the system node show command. After the last node is halted, you can
power down everything safely.
Physically power down the head, then all the attached disk shelves as needed.
Physically unplug the cables from power supplies on the back of the storage
systems and shelves, to avoid any electrical issues when external power is restored.
Procedure to bring the system back online:
Reconnect all the power cables if previously disconnected.
Power on core switches.
Physically power up all disk shelves first. Wait until 30 seconds after the last disk
shelf is powered on, then power on the storage system head so that all disks will be
available when they are required by Data ONTAP.
Verify the storage system is up, all services are running, and network connectivity is
present.
For 7-Mode HA pairs, if cluster was disabled using cf disable, enter cf enable and
monitor with cf status.
For clustered ONTAP systems, check cluster show and storage failover show to
confirm CFO/SFO is configured/enabled.
- If on version prior to 8.2 in which cluster ha and/or storage failover were disabled,
run the following commands:
::> cluster ha modify configured true
::> storage failover modify node * enabled true

2)Volume and Exports in HA Controller setup

NFS
I have a question regarding volume and export paths with controllers in an HA setup.
For general purposes, this is a pair of FAS2240 controllers with OnTAP 8.1RC3
Lets suppose we have two volumes on each controller:
Controller A:
/vol/vol0 (root volume)
/vol/data0 (some generic data, NFS export)
Controller B:

42

/vol/vol0 (root volume)


/vol/data0 (some generic data, NFS export)
Assume both controllers are up and serving data. If we issue a "cf takeover" on one of the
controllers, what happens? My concern is for the volume and export paths are they are the same
on each controller, but unique to the controller.
Is this setup invalid or is the OnTAP smart enough to know that the volume names and exports
are unique to the partner filer ?

A) During takeover virtual copy of partner filer is started with all its resources

(volumes, exports, IP addresses etc), so there is no confusion. Clients continue to


access the same exported resources via the same IP addresses as before.

3)Does "autosupport.doit" report info on both controllers in a HA


configuration?
When you issue an autosupport.doit command in an HA configuration, will the support message
contain info of both controllers?

A) you have to issue this command on both the controllers separately.


Reconfiguring an HA pair into two stand-alone
systems
To divide an HA pair so that the nodes become stand-alone systems without
redundancy, you must disable the HA software features and then remove the
hardware connections.
5)Ensuring uniform disk ownership within disk shelves and loops in the system

If a disk shelf or loop contains a mix of disks owned by Node A and Node B, you must use this
procedure to move the data and make disk ownership uniform within the disk shelf or loop.
Before you begin

You must ensure the following:


Disk ownership is uniform within all disk shelves and loops in the system
All the disks within a disk shelf or loop belong to a single node and pool
About this task
Note: It is a best practice to always assign all disks on the same loop to the same node and pool.
Steps

1. Use the following command to identify any disk shelves or loops that contain both disks
belonging to Node A and disks belonging to Node B:
diskshowv

2. Determine which node the disk shelf or loop with mixed ownership will be attached to when
the HA feature is unconfigured and record this information.

43

For example, if the majority of the disks in the loop belong to Node A, you probably want the
entire loop to belong to stand-alone Node A.
After you finish

Proceed to disable the HA software.

Configuring a node for non-HA (stand-alone) use


By default, storage controllers are configured for use in HA mode. To use a controller in standalone mode, you must disable the controller failover functionality and change the node to nonHA mode.
Before you begin

You must determine the current configuration of the storage controller because the controller
failover and HA mode states can vary. You can use the cfstatuscommand to determine the
current configuration.
You must also confirm that all loops and disk shelves in the system contain disks that belong to
only one of the two nodes you intend to isolate. If any disk shelves or loops contain a mix of
disks belonging to both nodes, you must move data.
About this task

When a storage controller is shipped from the factory or when Data ONTAP is reinstalled using
option four of the Data ONTAP boot menu (Cleanconfigurationandinitializeall
disks), HA mode is enabled by default, and the system's nonvolatile memory (NVRAM or
NVMEM) is split. If you plan to use the controller in standalone mode, you must configure the
node as non-HA. Reconfiguring the node as non-HA mode enables full use of the system's
nonvolatile memory.
Note: Configuring the node as standalone removes the availability benefits of the HA

configuration and creates a single point of failure.


Choices
If the cfstatusoutput displays NonHAmode, then the node is configured for non-HA

mode and you are finished:


Example
node>cfstatus
NonHAmode.
If the cfstatusoutput directs you to reboot, you must reboot the node to enable full use of the
system's nonvolatile memory:

Example
node>cfstatus
NonHAmode.ReboottousefullNVRAM.

a) Reboot the node using the following command:


node>reboot

After the node reboots, you are finished.


If the cfstatusoutput displays ControllerFailoverenabled, you must disable both
controller failover and HA mode and then reboot the node to enable full use of the system's

44

nonvolatile memory:
Example
node>cfstatus
ControllerFailoverenabled

a) Disable controller failover using the following command:


node>cfdisable

b) Set the mode to non-HA by using the following command:


node>optionscf.modenon_ha
c) Open the /etc/rcfile with a text editor and remove references to the partner node in the
ifconfigentries, as shown in the following example:
Example

Original entry:
ifconfige0199.9.204.254partner199.9.204.255

Edited entry:
ifconfige0199.9.204.254

d) Reboot the node by using the following command:


node>reboot

After the node reboots, you are finished.


If the cfstatusoutput displays ControllerFailoverdisabled, then the HA mode is
still enabled, so you must set the HA mode to non-HA and reboot the node to enable full use of
the system's nonvolatile memory:
Example
node>cfstatus
ControllerFailoverdisabled

a) Set the mode to non-HA by using the following command:


node>optionscf.modenon_ha

b) Reboot the node by using the following command:


node>reboot

After the node reboots, you are finished.

Configuring network interfaces for HA pairs


Configuring network interfaces requires that you understand the available configurations for
takeover and that you configure different types of interfaces (shared, dedicated, and standby)
depending on your needs.

Understanding interfaces in an HA pair


You can configure three types of interfaces on nodes in an HA pair.
What the networking interfaces do
When a node in an HA pair fails, the surviving node must be able to assume the identity of the
failed node on the network. Networking interfaces allow individual nodes in the HA pair to
maintain communication with the network if the partner fails.
Note: You should always use multiple NICs with interface groups to improve networking

availability for both stand-alone storage systems and systems in an HA pair.

45

Shared, dedicated, and standby interfaces

Note: Most HA pair interfaces are configured as shared interfaces because they do not require an

extra NIC.

Configuring network interfaces for the HA pair


You must configure network interfaces so that if takeover occurs, interfaces on the operating node
takes over interfaces on the failed-over node and hosts can still reach data over the network.
Before you begin

Both nodes in the HA pair must have interfaces that access the same collection of networks and
subnetworks.
You must gather the following information before configuring the interfaces:
The IP address for both the local node and partner node.
The netmask for both the local node and partner node.
The MTU size for both the local node and partner node.
The MTU size must be the same on both the local and partner interface.
Note: You should always use multiple NICs with interface groups to improve networking
availability for both stand-alone storage systems and systems in an HA pair.

Configuring an IP address for a network interface


You can configure IP addresses for your network interface during system setup. To configure the
IP addresses later, you should use the ifconfigcommand. You can configure both IPv4 and
IPv6 addresses for a network interface.
About this task

Network configuration changes made by using the ifconfigcommand are not automatically
included in the /etc/rcfile. To make the configuration changes persistent after reboots,
include the ifconfigcommand in the /etc/rcfile.
When you configure an IP address, your storage system creates a network mask based on the

46

class of the address (Class A, B, C, or D) by default.


Step

1. To configure an IP address for a network interface, enter the following command:


ifconfiginterface_nameIP_address
interface_nameis the name of the network interface.
IP_addressis the IP address that you want to assign to the network interface.
Example

To configure a quad-port Ethernet interface e3a to use the IPv4 address 192.0.2.10, enter the
following command:
ifconfige3a192.0.2.10

To configure a quad-port Ethernet interface e3a to use the IPv6 address 2001:0db8:35ab:0:8a2e:
0:0370:85, enter the following command:
ifconfige3a2001:0db8:35ab:0:8a2e:0:0370:85

Configuring a partner interface in an HA pair


To prepare for a successful takeover in an HA configuration, you can map a network interface to
an IP address or to another network interface on the partner node. During a takeover, the network
interface on the surviving node assumes the identity of the partner interface.
Before you begin

When specifying the partner IP address, both the local network interface and the partners
network interface must be attached to the same network segment or network switch.
About this task

If the network interface is an interface group, the partner interface must be denoted by an
interface name and not an IP address.
The partner interface can be an interface group or a physical network interface.
You cannot specify the underlying physical ports of an interface group in a partner
configuration.
If IPv6 addresses are to be taken over, you must specify the partner interface, and not an IP
address.
Address to address mapping is not supported for IPv6 addresses.
For the partner configuration to be persistent across reboots, you must include the ifconfig
command in the /etc/rcfile.
For a successful takeover in both directions, you must repeat the partner configuration in
the /etc/rcfiles of each node.
When specifying the partner interface name, you can configure the interfaces symmetrically, for
example map interface e1 on one node to interface e1 on the partner node.
Though symmetrical configuration is not mandatory, it simplifies administration and
troubleshooting tasks.

47

Configuring partner addresses on different subnets


(MetroCluster configurations only)
On MetroCluster configurations, you can configure partner addresses on different subnets. To do
this, you must create a separate /etc/mcrcfile and enable the
cf.takeover.use_mcrc_file

option. When taking over its partner, the node uses the partner's /etc/mcrcfile to configure
partner addresses locally. These addresses will reside on the local subnetwork.

48

Verifying the HA pair configuration


You can go to the NetApp Support Site and download the Config Advisor tool to check for
common configuration errors.
About this task

Config Advisor is a configuration validation and health check tool for NetApp systems. It can be
deployed at both secure sites and non-secure sites for data collection and system analysis.
Note: Support for Config Advisor is limited, and available only online.
Steps

1. Log in to the NetApp Support Site at support.netapp.com and go to Downloads > Utility
ToolChest.
2. Click Config Advisor (WireGauge renamed).
3. Follow the directions on the web page for downloading, installing, and running the utility.
4. After running Config Advisor, review the tool's output and follow the recommendations to
address any issues discovered.

Testing takeover and giveback


After you configure all aspects of your HA pair, you need to verify that it is operating as
expected in maintaining uninterrupted access to both nodes' storage during takeover and
giveback operations.
Throughout the takeover process, the local (or takeover) node should continue serving the data
normally provided by the partner node. During giveback, control and delivery of the partner's
storage should return transparently to the partner node.
Steps

1. Check the cabling on the HA interconnect cables to make sure that they are secure.
2. Verify that you can create and retrieve files on both nodes for each licensed protocol.
3. Enter the following command from the local node console:
cftakeover

See the man page for command details.


The local node takes over the partner node and gives the following output:
Failovermonitor:takeovercompleted
4. Use the sysconfigrcommand to ensure that the local (takeover) node can access its partner's
disks.
5. Give back the partner node's data service after it displays the Waitingforgivebackmessage
by entering the following command:
cfgiveback

The local node releases the partner node, which reboots and resumes normal operation. The
following message is displayed on the console when the process is complete:
givebackcompleted
@@@
print

49

What are the common causes of High


Availability 'Takeover Impossible' events?
Answer

Data ONTAP will not attempt a partner takeover when it can determine prior to the takeover
attempt that the takeover will fail.
The resulting ASUP will automatically open a customer support case with "TAKEOVER
IMPOSSIBLE" in the symptom field. The case symptom text will be of the form:
CLTFLT: Cluster Notification from (PARTNER DOWN, TAKEOVER IMPOSSIBLE) ERROR

This article describes how to diagnose five common causes of takeover impossible events, and
the actions required to correct the issues found. The focus is on remote diagnosis from ASUP
logs, primarily the MESSAGES and CLUSTER-MONITOR logs.
Approximately 70% of NetApp FAS3000, FAS3100 and FAS6000 systems are deployed as High
Availability (HA) configurations. Proper configuration of HA systems requires installing all
necessary HA hardware, enabling cluster software licenses, setting HA related options, and more.
Data ONTAP will not attempt a partner takeover when it can determine prior to the takeover
attempt that the takeover will fail. The resulting ASUP will automatically open a customer
support case with "TAKEOVER IMPOSSIBLE" in the symptom field. The case symptom text
will be of the form:
CLTFLT: Cluster Notification from (PARTNER DOWN, TAKEOVER IMPOSSIBLE) ERROR

Hourly alert messages will be posted to the console in many instances if the HA system is not
configured properly and takeover by the partner system is not possible. The messages will be of
the form: "statd:ALERT Cluster is licensed but takeover of partner is disabled." This article
describes several common messages and actions required to correct the configuration issues.
The focus is on remote diagnosis from ASUP logs, primarily the MESSAGES and CLUSTERMONITOR logs.
Five common types of statd:ALERT messages are described below:

Cluster is licensed but takeover of partner is disabled

Cluster is licensed but takeover of partner is disabled due to reason:


interconnect error

50

Cluster is licensed but takeover of partner is disabled due to reason: partner


mailbox disks not accessible or invalid

Cluster is licensed but takeover of partner is disabled due to reason: CFO not
licensed

Cluster is licensed but takeover of partner is disabled due to reason:


unsynchronized log

Cluster is licensed but takeover of partner is disabled


The ASUP MESSAGES log will have hourly messages of the form:
[: statd:ALERT]: Cluster is licensed but takeover of partner is disabled.

The most common reason that systems report this message is the takeover functionality has been
disabled manually by an operator. An operator has entered cf disable from the console command
line. Entering cf enable will re-enable takeover and clear the hourly ALERT message.
To confirm that takeover has been disabled by the operator, check the ASUP CLUSTERMONITOR log. The fifth entry in the log begins with "takeoverByPartner". If takeover has
been manually disabled, the entry will contain the text string:
"NVRAM_DOWN,CLUSTER_DISABLE"

Example:
===== CLUSTER MONITOR =====
cf: Current monitor status (28Jun2009 00:00:02):
partner 'NetApp1' VIA Interconnect is up (link 0 up, link 1 up)
state UP, time 90788045660, event CHECK_FSM, elem ChkMbValid (12)
mirrorConsistencyRequired TRUE
takeoverByPartner 0x2041

<<< look here

Cluster is licensed but takeover of partner is disabled due to reason: interconnect error
The ASUP MESSAGES log will have hourly entries of the form:
[: statd:ALERT]: Cluster is licensed but takeover of partner is disabled due
to reason : interconnect error

The interconnect link status is shown as the second line in the CLUSTER-MONITOR log. In the
examples below, the interconnect is not present or both links are down.
===== CLUSTER MONITOR =====
cf: Current monitor status (28Jun2009 00:00:01):
partner 'NetApp1', Interconnect not present <<< look here

51

===== CLUSTER MONITOR =====


cf: Current monitor status (28Jun2009 00:00:02):
partner 'NetApp1', VIA Interconnect is down (link 0 down, link 1 down)

<<<

look here
Another common abnormal condition shows the "partner" as "unknown".
===== CLUSTER MONITOR =====
cf: Current monitor status (28Jun2009 00:00:02):
partner 'unknown', VIA Interconnect is down (link 0 down, link 1 down)

<<<

look here
The corrective action required is to verify the interconnect cables/links are connected and active.
When the partner is reported as 'unknown', verify that the partner filer/platform is present and
active. If no partner system is present, then likely the system was once part of a HA pair, and
was improperly reconfigured as standalone. See the documentation (Removing an active/active
configuration) for more information about how to properly split a cluster and clear the 'unknown'
partner messages.
Cluster is licensed but takeover of partner is disabled due to reason: partner mailbox disks
not accessible or invalid
The ASUP MESSAGES log will have hourly entries of the form:
[ statd:ALERT]: Cluster is licensed but takeover of partner is disabled due to
reason : partner mailbox disks not accessible or invalid

The status of the mailbox disks is shown approximately 15 lines from the top of the CLUSTERMONITOR log. A normal entry will show the disk paths for all of the mailbox disks. An
example below is provided for illustration. The disk identifiers (4a.17, 4a.29, 8b.34, 8b.35 in the
example) will vary depending on the system configuration.
mailbox disks:
Disk 4a.17 is a
Disk 4a.29 is a
Disk 8b.34 is a
Disk 8b.35 is a

primary
primary
partner
partner

mailbox
mailbox
mailbox
mailbox

disk
disk
disk
disk

Two common abnormal conditions:


1. No partner disk entries. Instead, log contains 'No partner disks attached!'
mailbox disks:

52
Disk 8a.20 is a local mailbox disk
Disk 8a.19 is a local mailbox disk
No partner disks attached!
<<< look here

2. Some partner disks shown with path as '?.?'.


mailbox disks:
Disk 4a.17 is a primary mailbox disk
Disk 4a.29 is a primary mailbox disk
Disk ?.? is a partner mailbox disk <<< look here
Disk ?.? is a partner mailbox disk <<< look here

To correct these fault conditions, first check that the partner system is present and active. Then
check the FC adapters in the filers/platforms and shelf cabling to each of the mailbox disk
shelves.
If the problem continues, check if 'partner-sysid' shows a correct partnersysid.
CFE> printenv
Variable Name
-------------------BOOT_CONSOLE
fcal-host-id
partner-sysid

Value
-------------------------------------------------rlm0a
7
0101183784

Then attempt the following steps, this should be done on both HA controllers:
1. Disable clustering by typing cf disable.
2. Reboot
3. Press Ctrl-C during the boot sequence to go to the special boot menu.
4. Select option 5 to go into Maintenance mode.
5. Type: mailbox destroy local
6. Type: mailbox destroy partner
7. Type: halt
8. Reboot the head.
9. Type: cf enable
10.Type: ic stats error -v

53

Note: Possible stale mailbox instance on local/remote site results with the following message on
the storage system: [ds-dt01terra: fmmbx_instanceWorke:info]: missing lock disks,
possibly stale mailbox. After reassigning the drives during an upgrade, no mailbox disks
were visible. Missing mailbox disks. The local and remote instance of mailbox disks need to be
re-initialized. Perform the steps 1 to 10 above, on both the nodes.
A useful tool to help in the diagnosis of disk pathing issues is Config Advisor (WireGauge
renamed), which is available from the NOW ToolChest.
WireGauge can be run remotely by entering an ASUP ID. (Enter an ASUP ID by selecting "File
> Get ASUP"). Comparing WireGauge results from both HA partners will often indicate the
cause of the mailbox disk path issue.
Cluster is licensed but takeover of partner is disabled due to reason: CFO not licensed
The ASUP MESSAGES log will have hourly entries of the form:
[: statd:ALERT]: Cluster is licensed but takeover of partner is disabled due
to reason : CFO not licensed

If the CLUSTER-MONITOR log contains the following message, the cluster license is not
enabled.
===== CLUSTER MONITOR =====
Clustered failover is now unlicensed
cf: option 'monitor' requires that cluster licensing is enabled

Re-enabling the cluster license will clear this error. See Enabling licenses for more details.
A common cause is the system was once part of a High Availability pair, and was improperly
reconfigured as standalone. See Removing an active/active configuration for more information
about how to properly split a High Availability pair.
Cluster is licensed but takeover of partner is disabled due to reason: unsynchronized log
The ASUP MESSAGES log will have hourly entries of the form:
[: statd:ALERT]: Cluster is licensed but takeover of partner is disabled due
to reason : unsynchronized log

This is usually associated with problems with the interconnect cabling.


First, verify that the interconnect cables are not cross-connected. On FAS3000 and FAS6000

54

systems, the two interconnect ports are on the NVRAM card. Verify Port 0 is connected to Port
0, and Port 1 to Port 1, on each system in the HA pair.
In some instances, momentarily unplugging and reseating each interconnect cable will clear this
error. Breaking and reestablishing the interconnect link will force the logs to re-synchronize.
Changes in High Availability 'Takeover Impossible' events in Data ONTAP 8.x
1. There are additional EMS messages that describe reason for takeover
impossible. The messages start with 'ha.takeoverImp'.
ha.takeoverImpIC:warning]: Takeover of the partner node is impossible
because of interconnect errors.
ha.takeoverImpNotDef:warning]: Takeover of the partner node is impossible
due to reason status of backup mailbox is uncertain.
ha.takeoverImpNotDef:warning]: Takeover of the partner node is impossible
due to reason partner booting.
ha.takeoverImpUnsync:warning]: Takeover of the partner node is impossible
due to lack of partner NVRAM data.
ha.takeoverImpNotDef:warning]: Takeover of the partner node is impossible
due to reason partner halted in notakeover mode.
2. The hourly takeover disabled message changed in Data ONTAP 8. See the
following link:
Syslog Translator

Controller Failover is licensed but takeover of partner is


disabled due to reason : Controller Failover not initialized

Controller Failover is licensed but takeover of partner is


disabled due to reason : Controller Failover not licensed

Controller Failover is licensed but takeover of partner is


disabled due to reason : interconnect error

Controller Failover is licensed but takeover of partner is


disabled due to reason : local halt in progress

Controller Failover is licensed but takeover of partner is


disabled due to reason : NVRAM size mismatch

Controller Failover is licensed but takeover of partner is


disabled due to reason : partner booting

Controller Failover is licensed but takeover of partner is


disabled due to reason : partner halted in notakeover mode

55

Controller Failover is licensed but takeover of partner is


disabled due to reason : partner mailbox disks not accessible or
invalid

Controller Failover is licensed but takeover of partner is


disabled due to reason : status of backup mailbox is uncertain

Controller Failover is licensed but takeover of partner is


disabled due to reason : takeover disabled by partner

Controller Failover is licensed but takeover of partner is


disabled due to reason : unsynchronized log

Controller Failover is licensed but takeover of partner is


disabled due to reason : version mismatch

Controller Failover is licensed but takeover of partner is


disabled due to reason : waiting for partner to recover

Controller Failover is licensed but takeover of partner is


disabled: partner identification not accessible or invalid

3. The name of one of the ASUP logs changed in 8.X to CF-MONITOR.

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
How to power down and power up the controllers in a 7-Mode HA-Pair
Description

Shutting down an HA Pair of controllers.


This article describes the procedure that should be followed to power down and power up
process with a NetApp storage and shelves.
Procedure

Perform the following steps:


1. Pre-Power Down:
i.

Dial 1-888-463-8277 and talk to a Customer Support Representative


(CSR). They need to be advised that the firm has a scheduled power
down or open a ticket through the NetApp Support site.

ii.

The serial numbers of the storage systems will be required. This can be
obtained by running the sysconfig command on the console or
opening OnCommand System Manager and observing the serial
number listed.

56
iii.

Ensure that critical client applications are terminated and users are
warned prior to proceeding with the shutdown of the storage systems.

2. Power Down:
There are two ways to power down the storage systems in a cluster:
i.

Disable and re-enable the cluster by running the cf disable and


cf enable commands:

Disable the cluster by running the following command into a


console or telnet session on one of the nodes:
cf disable

Run the following command in a telnet or console session


connected to each node:
cf status

The result should show that the cluster is disabled. Users will not
suffer an interruption when a cf disable or cf enable command
is executed.

If CIFS is in use, check the CIFS sessions and stop CIFS services
by entering (per vfiler if MultiStore is in use):
[vfiler run *] cifs sessions
[vfiler run *] cifs terminate -t 0

OR
any number of minutes you wish to wait.

Check on SnapMirror, SnapVault and ndmp sessions, since these


should not be interrupted by a shutdown (ndmp sessions can be
killed if hung):
snapmirror status -t
the output displays the relationships
that are active.
snapvault status -t
[vfiler run *] ndmpd probe

Use console sessions to perform the following operations:


halt

Power down both nodes. After maintenance, power up both the


nodes.
During power up, both nodes will start serving data.

Run the following command into a telnet or console session


connected to each node:
cf status

The result should show that the cluster is disabled.

57

Enable the cluster by running the following command into a


telnet or console session on one of the nodes:
cf enable

Run the following command in a into a telnet or console session


connected to each node:
cf status

The results should show that the cluster is enabled. The cluster
should now be up and running.
ii.

Using halt -f:

Run the following command in a console session connected to


each node:
Note: This will also disable takeover
halt -f

Output similar to the following will be displayed on each node at


shutdown:
storage system> halt -f
Wed Jun 11 12:26:45 GMT [storage system:
kern.shutdown:notice]: System shut down because : "halt".
Wed Jun 11 12:26:46 GMT [storage system:
cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor:
takeover of filer disabled (local halt in progress)
DBG: Shut down RAID hilevel took 0 msecs
DBG: Shut down WAFL took 77 msecs (0 0 0 0 0 1 76)
DBG: Shut down RAID took 88 msecs
Program terminated
ok boot

Power down both nodes. After maintenance, power up both the


nodes.

The first storage system to come up may show the following


status when you run cf status in the storage system's
console or telnet command line to check the cluster status:

storage system> cf status


storage system may be down, takeover disabled because of
reason (partner halted in notakeover mode)
storage system has disabled takeover by storage
system2 (unsynchronized log)
VIA Interconnect is down (link up).

When the second storage system comes up, it will automatically


try to enable the cluster. When the storage system is fully up,
run cf status at this storage system's console / telnet session.
The result should show that the cluster is now enabled. Run the
same command at the other storage system's telnet or console
session to verify that both the nodes are reporting a consistent
status. The cluster should now be fully operational.

58
3. Power Up:
i.

Power on the SAN fabric (switches).

ii.

In a Fabric Metrocluster configuration, power up the back-end switches


and for SATA usage followed by the ATTO bridges.

iii.

Power on the drive shelves for each storage system.

iv.

Power on the heads (Controllers), one at a time.

v.

Telnet to one of the storage systems, and run the command, cf enable.
This will re-enable clustering. (additional information in Part V)

vi.

Power on the hosts.

vii.

Verify access to storage system based shares, exports and/or LUNs.

4. Post Power Up:


i.

Dial 1-888-463-8277 and talk to a CSR (Customer Support


Representative). Advise them that the firm has completed the
scheduled power down or update the ticket via the NetApp Support
site.

ii.

The serial number of your storage systems will be required. This can be
obtained by typing sysconfig on the console or opening FilerView >
Status.

Additional Information (Based on MAN PAGES):


i.

halt:

The halt flushes all cached data to disk, turns off the non-volatile RAM, and
drops into the monitor. Any time you power-off the storage system, run the
halt command to conserve the batteries on the non-volatile RAM.
NFS clients can maintain use of a file over a halt or reboot (although
experiencing a failure to respond during that time), but CIFS clients cannot
perform so safely.
If the storage system is running CIFS, the halt command invokes cifs
terminate, which requires the -t option. If the storage system has CIFS clients
and you invoke halt without -t, it displays the number of CIFS users and the
number of open CIFS files. Then it prompts you for the number of minutes to
delay. cifs terminate automatically notifies all CIFS clients that a CIFS
shutdown is scheduled in X minutes, and asks them to close their open files.
CIFS files that are still open at the time the storage system halts will lose any
writes that had been cached but not written.
halt logs a message in /etc/messages to indicate that the storage system was

59
halted on purpose.
ii.

cf:
The cf command controls the cluster failover monitor, which determines when
the takeover and giveback operations take place within a cluster. The cf
command is available only if your storage system has the cluster license.
OPTIONS:
disable

Disables the takeover capability of both storage systems in the cluster.


enable

Enables the takeover capability of both storage systems in the cluster.


forcegiveback

Forces the live storage system to give back the resources of the
failed storage system even though the live storage system detects an error
that would prevent a complete giveback. For example, an error might prevent
the failed storage system from flushing data in the NVRAM to disk during a
giveback. If the live storage system detects this error, it does not perform a
giveback. However, using the forcegiveback option forces a giveback despite
such an error. When the failed storage system reboots as a result of a forced
giveback, it displays the following message:
partner giveback incomplete, some data may be lost
forcetakeover

Forces one storage system to take over its partner even though the storage
system detects an error that would otherwise prevent a takeover. For
example, normally, if a detached or faulty ServerNet cable between
the storage systems causes the storage system's NVRAM contents to be
unsynchronized, takeover is disabled. However, if you run the cf
forcetakeover command, the storage system takes over its partner despite
the unsynchronized NVRAM contents. This command might cause the storage
system being taken over to lose client data.
giveback [ -f ] initiates a giveback of partner resources. Once the giveback

is complete, the automatic takeover capability is disabled until the partner is


rebooted. A giveback fails if there are outstanding CIFS sessions or active
system dump processes. If the -f option is used, the cf command
terminates the outstanding CIFS sessions and dump processes before
attempting a giveback.
partner

Displays the host name of the partner. If the name is unknown, the cf
command displays partner.
status

Displays the current status of the local storage system and the cluster.

60
takeover

Initiates a takeover of the partner.

Das könnte Ihnen auch gefallen