Beruflich Dokumente
Kultur Dokumente
High-Availability
In cluster ontap the recommended cluster switch are NetApp CN1610, and CISCO NEXUS
5596
your cluster is upto 12 node you can use NetApp CN1610 switch. If your cluster is more
HIGH-AVAILABILITY FEATURES
High-availability configurations provide fault tolerance and the ability to perform nondisruptive upgrades
and maintenance.
High-availability configurations provide the following benefits:
Fault toleranceWhen one node fails or becomes impaired, a takeover occurs and the partner node
continues to serve the data of the failed node.
Nondisruptive software upgradesWhen you halt one node and allow takeover, the partner node
continues to serve data for the halted node, allowing you to upgrade the halted node.
Nondisruptive hardware maintenanceWhen you halt one node and allow takeover, the partner
node continues to serve data for the halted node, allowing you to replace or repair hardware on the halted
node.
3
limit. It is acceptable for the takeover node to temporarily serve more than the single-node capacity would
normally allow, as long as it does not own more than the single-node capacity.
Disks and disk-shelf compatibility
Both Fibre Channel (FC) and SATA storage is supported in standard high-availability configurations, as
long as the two storage types are not mixed on the same loop.
If needed, a node can have only FC storage and the partner node can have only SATA storage.
Cluster interconnect adapters and cables must be installed.
Nodes must be attached to the same network and the network interface cards must be configured
correctly.
System features such as CIFS, NFS, or SyncMirror software must be licensed and enabled on both
nodes.
PARTNER COMMUNICATION
To ensure that both nodes in a high-availability controller configuration maintain the correct and current
status of the partner node, heartbeat information and node status are stored on each node in the mailbox
disks. The mailbox disks are a redundant set of disks used in coordinating takeover or giveback
operations. If one node stops functioning, the surviving partner node uses the information on the mailbox
disks to perform takeover processing, which creates a virtual storage system. In the event of an
interconnect failure, the mailbox heartbeat information prevents an unnecessary failover from occurring.
Moreover, if cluster configuration information that is stored on the mailbox disks is out of sync during
boot, the high-availability controller nodes automatically resolve the situation. The FAS system failover
process is extremely robust, preventing split-brain issues from occurring.
6
Same value in local+partner recommended
TAKEOVER OPERATION
When a takeover occurs, the functioning partner node takes over the functions and disk drives of the
failed node by creating an emulated storage system that:
Assumes the identity of the failed node
Accesses the failed nodes disks and serves its data to clients
The partner node maintains its own identity and its own primary functions, but also handles the added
functionality of the failed node through the emulated node.
GIVEBACK OPERATION
After a partner node is repaired and operating normally, you can use the cf giveback command to
return operations to the partner.
When the failed node is functioning again, the following events can occur:
You initiate a cf giveback command that terminates the emulated node on the partner.
The failed node resumes normal operation, serving its own data.
The high-availability configuration resumes normal operation, with each node ready to take over for
its partner if the partner fails.
10
11
12
13
14
NEGOTIATED FAILOVER
To enable negotiated failover in the event of a failed network interface, you must explicitly enable the
cf.takeover.on_network_interface_failure option, set the failover policy, and mark
each interface that can trigger a negotiated failover (NFO).
NOTE: The cf.takeover.on_network_interface_failure.policy option must be set
manually on each controller in a high-availability pair: all_nics= ALL interfaces marked for failover
must fail before takeover will occur any_nic= ANY interface marked for failover will trigger a highavailability takeover. The use of the cf.takeover.on_network_interface_failure option
is not the first line of defense against a network switch being a single point of failure. This option should
only be considered when a single-mode vif or second-level vif cannot be used. Controller failover is
15
disruptive to CIFS clients and can be disruptive to NFS clients using soft mounts. However, vif failover is
completely nondisruptive and is therefore the preferred method. However, negotiated failover is used
increasingly in a MultiStore environment.
BEST PRACTICES
General best practices require comprehensive testing of all mission-critical systems before introducing
them into a production environment. High-availability controller testing should include takeover and
giveback, or functional testing as well as performance evaluation. Extensive testing validates planning.
Monitor network connectivity and stability.
Unstable networks not only affect total takeover and giveback times, they adversely affect all devices on
the network in various ways. NetApp storage controllers are typically connected to the network to serve
data, so if the network is unstable, the first symptom is degradation of storage-controller performance and
availability. Client service requests are retransmitted many times before reaching the storage controller,
appearing to the client as slow responses from the storage controller. In a worst-case scenario, an unstable
network can cause communication to time-out, and the storage controller appears to be unavailable.
During takeover and giveback operations in the high-availability controller environment, storage
controllers attempt to connect to numerous types of servers on the network, including Windows domain
controllers, DNS, NIS, LDAP, and application servers. If these systems are unavailable or the network is
unstable, the storage controller continues to retry establishing communications, which delays takeover or
giveback times.
@@@@@@
What an HA pair is
An HA pair is two storage systems (nodes) whose controllers are connected to each other either
directly or, in the case of a fabric-attached MetroCluster, through switches and FC-VI interconnect
adapters. In this configuration, one node can take over its partner's storage to provide continued data
service if the partner goes down.
16
You can configure the HA pair so that each node in the pair shares access to a common set of
storage, subnets, and tape drives, or each node can own its own distinct set of storage.
The controllers are connected to each other through an HA interconnect. This allows one node to
serve data that resides on the disks of its failed partner node. Each node continually monitors its
partner, mirroring the data for each others nonvolatile memory (NVRAM or NVMEM). The
interconnect is internal and requires no external cabling if both controllers are in the same chassis.
Takeover is the process in which a node takes over the storage of its partner. Giveback is the process
in which that storage is returned to the partner. Both processes can be initiated manually or
configured for automatic initiation.
1. Enter the following command on each of the node consoles to enable HA mode capability:
optionscf.modeha
2. Enter the following command on each of the node consoles to reboot the nodes:
reboot
3. Enter the following command on either of the node consoles to enable controller failover:
cfenable
4. Verify that controller failover is enabled by entering the following command on each node
console:
cfstatus
possible. For MetroCluster configurations, you can initiate a forced takeover in this situation.
17
One or more network interfaces that are configured to support failover become unavailable.
A node does not receive heartbeat messages from its partner.
This could happen if the partner experienced a hardware or software failure that did not result in
a panic but still prevented it from functioning correctly.
You halt one of the nodes without using the fparameter.
You reboot one of the nodes without using the fparameter.
Hardware-assisted takeover is enabled and triggers a takeover when the remote management
device (RLM or Service Processor) detects failure of the partner node.
the CIFS sessions is not possible, and some data loss could occur for CIFS users.
If the node doing the takeover panics
If the node that is performing the takeover panics within 60 seconds of initiating takeover, the
following events occur:
The node that panicked reboots.
After it reboots, the node performs self-recovery operations and is no longer in takeover mode.
Failover is disabled.
18
The local node returns ownership of the aggregates and volumes to the partner node after any
issues on the partner node are resolved or maintenance is complete. In addition, the local node
returns ownership when the partner node has booted up and giveback is initiated either manually
or automatically.
When the failed node is functioning again, the following events can occur:
You issue a cfgivebackcommand that terminates the emulated node on the partner.
The failed node resumes normal operation, serving its own data.
The HA pair resumes normal operation, with each node ready to take over for its partner if the
partner fails.
If your system supports remote management (through an RLM or Service Processor), make
sure that you configure it properly,
Follow recommended limits for FlexVol volumes, dense volumes, Snapshot copies, and LUNs to
reduce the takeover or giveback time.
When adding traditional or FlexVol volumes to an HA pair, consider testing the takeover and
giveback times to ensure that they fall within your requirements.
For systems using disks, check for failed disks regularly and remove them as soon as possible, as
described in the Data ONTAP Storage Management Guide for 7-Mode. Failed disks can extend
the duration of takeover operations or prevent giveback operations.
Multipath HA is required on all HA pairs except for some FAS22xx system configurations, which
use single-path HA and lack the redundant standby connections.
19
and system firmware versions. See the Data ONTAP Release Notes for 7-Mode for the list of
supported systems.
Nonvolatile memory (NVRAM or NVMEM) size and version compatibility
The size and version of the system's nonvolatile memory must be identical on both nodes in an
HA pair.
Storage capacity
The number of disks or array LUNs must not exceed the maximum configuration capacity. If
your system uses both native disks and array LUNs, the combined total of disks and array LUNs
cannot exceed the maximum configuration capacity. In addition, the total storage attached to each
node must not exceed the capacity for a single node.
Disks and disk shelf compatibility
FC, SATA, and SAS storage are supported in standard HA pairs.
FC disks cannot be mixed on the same loop as SATA or SAS disks.
Different connection types cannot be combined in the same loop or stack.
Different types of storage can be used on separate stacks or loops on the same node. You can
also dedicate a node to one type of storage and the partner node to a different type,
Multipath HA is required on all HA pairs except for some FAS22xx system configurations,
installed on it. If the takeover node does not have a license that was being used by the partner
node to serve data, your HA pair loses functionality after a takeover.
For an HA pair using array LUNs, both nodes in the pair must be able to detect the same array
LUNs.
However, only the node that is the configured owner of a LUN has read-and-write access to that
LUN. During takeover operations, the emulated storage system maintains read-and-write access
to the LUN.
20
The following diagram shows the connections between the controllers and the disk shelves for an
example HA pair using multipath HA. The redundant primary connections and the redundant standby
connections are the additional connections required for multipath HA for HA pairs.
Advantages of multipath HA
Multipath connections in an HA pair reduce single-points-of-failure.
By providing two paths from each controller to every disk shelf, multipath HA provides the
following advantages:
21
The loss of a disk shelf module, connection, or host bus adapter (HBA) does not require a
failover.
The same storage system can continue to access the data using the redundant path.
The loss of a single disk shelf module, connection, or HBA does not prevent a successful
failover.
The takeover node can access its partners disks using the redundant path.
You can replace modules without having to initiate a failover.
22
neither node has taken over the other).
The following example shows that the HA pair is enabled and the interconnect is up and working
correctly:
node1>cfstatus
ControllerFailoverenabled,node2isup.
RDMAInterconnectisup(Link0up).
If the output shows that one link is down, the HA pair is degraded and you must configure the
link so that it is up while the other link is still active.
Note: Depending on the storage system model, the output might display either RDMA
interconnector VIAinterconnectin the last line.
Note: Data ONTAP can disable controller failover if a software or hardware problem exists
that prevents a successful takeover. In this case, the message returned from the cfstatus
command describes the reason failover is disabled.
23
24
partner sends back an acknowledgment, and a message indicating the successful receipt of the test
alert is displayed on the console.
Step
Depending on the message received from the cfhw_assisttestcommand, you might need to
reconfigure options so that the HA pair and the remote management card are operating.
1. Enter the following command to display or clear the hardware-assisted takeover statistics,
respectively:
cfhw_assiststats
cfhw_assiststatsclear
25
You can display the name of the other node with the cfpartnercommand.
Step
26
27
28
You can halt the node and prevent its partner from taking over. For example, you might need to
perform maintenance on both the storage system and its disks and want to avoid an attempt by
the partner node to write to those disks.
Step
29
Step
If this option is enabled and a takeover occurs because of a reboot, then an automatic giveback is
performed after the partner has booted. This giveback occurs even if the
cf.giveback.auto.enableoption is set to off. However, if a node takes over its partner due to a
reboot and that node itself reboots before it can execute a giveback, it performs automatic giveback
only if cf.giveback.auto.enableis set to on.
If the cf.takeover.on_rebootis off and a node is rebooted then the partner will not take over
immediately. But the partner could take over later if the node takes more than 180 seconds to boot.
Note: If the rebootfcommand is used, then the partner does not take over under any
If you enter this command on one node, the value applies to both nodes.
The setting of this option is persistent across reboots.
By default, Data ONTAP will initiate an automatic giveback after a takeover on panic.
The cf.giveback.auto.after.panic.takeoveroption can be used to disable this
automatic giveback.
Steps
30
2. Enable or disable automatic takeover on panic by entering the following command:
optionscf.takeover.on_panic{on|off}
onenables immediate takeover of a panicked node. This is the default value.
offdisables immediate takeover of a panicked node. If you disable this option, normal takeover
procedures apply: if a node panics and stays down without sending messages to its partner for 15
seconds, the partner then automatically takes over the failed node.
Both partners do not need to have the same value for this option. Thus, you can have one partner that
takes over more quickly than the other.
Note: If your HA pair is failing over because one of the nodes is too busy to respond to its partner,
increase the value of the cf.takeover.detection.secondsoption on the partner.
Step
might not be generated for some system panics. Use caution when assigning a takeover time of
less than 15 seconds.
You can specify the nfooption for an interface group. However, you cannot specify the nfooption
for any underlying physical interface of the interface group.
Steps
31
Giveback operations
Giveback can be implemented and configured in a number of different ways. It can also be
configured to occur automatically.
32
1. Remove the failed disks, as described in the Data ONTAP Storage Management Guide for 7Mode.
After you finish
When all failed disks are removed or replaced, proceed with the giveback operation.
Initiating normal giveback
You can return control to a taken-over partner with the cfgivebackcommand.
Forcing giveback
Because the takeover node might detect an error condition on the failed node that typically prevents a
complete giveback such as data not being flushed from NVRAM to the failed nodes disks, you can
force a giveback, if necessary.
About this task
You can use this procedure to force the takeover node to give back the resources of the failed node
even if the takeover node detects an error that typically prevents a complete giveback.
Note: The cfforcegivebackcommand should be used with caution because it can cause a loss
of data. If you cannot risk loss of data and are unable to complete the giveback, contact technical
support.
Steps
succeed.
When you use this command, you risk losing any data committed to NVRAM but not to disk.
If a cifsterminatecommand is running, allow it to finish before forcing a giveback.
If giveback is interrupted
If the takeover node experiences a failure or a power outage during the giveback process, that
process stops and the takeover node returns to takeover mode until the failure is repaired or the
power is restored.
However, this depends upon the stage of giveback in which the failure occurred. If the node
encountered failure or a power outage during partial giveback state (after it has given back the
root aggregate), it will not return to takeover mode. Instead, the node returns to partial-giveback
mode. If this occurs, complete the process by repeating the giveback operation.
33
Configuring giveback
You can configure how giveback occurs, setting different Data ONTAP options to improve the speed
and timing of giveback.
During the delay, the system periodically sends notices to the affected clients. If you specify 0,
CIFS
clients are terminated immediately.
This option is used only if automatic giveback is enabled.
Step
34
35
36
HIGH-AVAILABILITY CONFIGURATION
In the past, NetApp used the term active-activeto describe the high-availability (HA) controller
failover configuration. Two controller heads are configured as a pair, with each node providing failover
support for its partner.
CONFIGURATION
High-availability is a licensed feature that must be enabled before it can be configured and used. The
high-availability feature must be licensed on both HA nodes.
License the first node.
license add <licnum>
License the second node.
license add <licnum>
NOTE: You must then reboot and start the controller failover feature.
After the high-availability feature is enabled, you can unlicense it only when the HA pair is in a normal
state and the controller failover services are manually disabled.
Before you enable the high-availability feature, you must configure various settings. For example, the HA
pair interconnect cable must be attached, both controllers must be provided Fibre connections to all
expansion drawers, and both controllers must be provided access to the same IP subnets.
The high-availability feature activates numerous high availability capabilities, such as NVRAM
mirroring, which enables the controllers to provide failover support for each other.
For example, if a controller failure occurs:
37
The surviving node spawns a virtual instance of the failed node
The virtual node accesses its mirrored NVRAM to complete any interrupted write
The local network interface assumes the IP address of both the local and partner interfaces (for
Ethernet traffic)
The local FC interfaces retain their original WWPN addresses, and the host-based MPIO drivers direct
all FC traffic via the interfaces (assuming that single-image cfmode is being used)
The process of removing a high-availability configuration is as follows:
1. Disable the controller failover feature (cf disable).
2. Delete the controller failover license (license delete ).
3. Remove the partners network entries from the /etc/rc file.
4. Halt, and make sure the partner-sysid is blank.
5. Power down and remove or relocate the controller failover interconnect card.
6. Repeat the process on the other controller
ADMINISTRATION
In a high-availability configuration, all disks are visible to both controllers. Before a disk can be used in
an aggregate or a spare, it must be assigned to one or the other controller. This process is known as
software disk assignment.If a disk is not assigned to a controller (in order words, if it is listed as not
owned), then it cannot be used by either controller for any purpose.
Use the following commands to manage disk ownership:
Assign disk ownership
disk assign
List disk ownership (several methods)
disk show v
storage show disk
sysconfig r
NOTE: Earlier versions of Data ONTAP supported hardware disk assignment, where ownership was
determined by the Fibre Channel cabling topology. This mode is not supported on any current-generation
controller.
Generally, administration of a high-availability configuration and administration of two non-clustered
controllers are identical. The clustered controllers are managed separately, although some configuration
settings must be synchronized between the two controllers. One of the features that you must master is the
process of HA pair failover and failback.
For example, after a failed controller is rebooted and ready to assume its old identity and workload, it
displays a waiting for givebackor "waiting for mb giveback" message. At this point, the administrator
enters the cf giveback command on the operational controller to return the failed controller back to
the normal state.
NOTE: For more information about controller failover management, refer to the product manuals.
38
PERFORMANCE
Optimum performance is usually achieved when the controllers in a high-availability configuration share
the client workload evenly. An even distribution of the workload is usually attributable to good solution
planning and to automatic load balancing in the host-based MPIO drivers (for FC traffic).
In an FC SAN environment, ensure that the host-based multipathing support is correctly configured.
Where appropriate, use Asymmetric Logical Unit Access (ALUA) support.
In most other ways, the performance concerns of a high-availability configuration and of a non-clustered
configuration are identical.
SECURITY
In almost all aspects of security, a high-availability configuration and a non-clustered configuration are
identical.
TROUBLESHOOTING
In a high-availability configuration, both controllers require connectivity to all of the disk expansion
shelves. It is not possible to have a shelf connected to one controller and not to the other controller. If a
controller loses access to one of the disk shelves, a negotiated (clean, but automatic) failover is triggered.
It is recommended that multipath HA cabling be used for the disk-expansion shelf connections. The
cabling prevents unnecessary controller failover for non-critical reasons, such as SFP failure, loop breaks,
ESH module failure, and so on.
@@@@@@
Parameters
Network interfaces
Configuration files
39
Checking the hardware-assisted takeover status of the local and partner node
You can check the status of the hardware-assisted takeover configuration with the
cfhw_assiststatuscommand. It shows the current status for the local and partner nodes.
Example of hardware-assisted takeover status
The following example shows output from the cfhw_assiststatuscommand:
LocalNodeStatusha1
Active:Monitoringalertsfrompartner(ha2)
port4004IPaddress172.27.1.14
PartnerNodeStatusha2
Active:Monitoringalertsfrompartner(ha1)
port4005IPaddress172.27.1.15
HA pair, the local chassis. To ensure consistent HA state information throughout the system,
you must also run these commands on the partner controller module and chassis, if necessary.
The HA state is recorded in the hardware PROM in the chassis and in the controller module. It must
be consistent across all components of the system,
Steps
1. Reboot the current controller module and press Ctrl-C when prompted to display the boot
menu.
2. At the boot menu, select the option for Maintenance mode boot.
3. After the system boots into Maintenance mode, enter the following command to display the
HA state of the local controller module and chassis:
haconfigshow
5. If necessary, enter the following command to set the HA state of the chassis:
haconfigmodifychassishastate
7. Boot the system by entering the following command at the boot loader prompt:
boot_ontap
40
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
1)It is possible to do a safely shutdown with a Netapp FAS-2240 include two controllers and HA
we have a Netapp FAS-2240 with two controllers and HA. Now next week we must
do a complete shutdown. How we can perform this safely?
A) Answer
In what order to power elements up and down during a power cycle procedure?
Data ONTAP 7-Mode procedure to bring system down:
If the storage system is clustered, enter cf disable. Alternatively, follow the section
'For a clustered storage system...' given below Step 4.
Notify, disconnect and, if needed, shut down all of the connected CIFS/NFS clients.
If there are any hosts that have FCP or iSCSI-based LUNs, shut them down before
shutting down the storage system.
Terminate CIFS with the cifs terminate command.
Run the halt command at the storage system command line interface. Allow the
storage system to terminate Data ONTAP, and then to the 'ok', 'cfe' or 'loader'
prompt.
For a clustered storage system if you did not run cf disable: run the halt -f
-t command on each of the partners.
Physically power down the head, then all the attached disk shelves as needed.
Physically unplug the cables from power supplies on the back of the storage
systems and shelves, to avoid any electrical issues when external power is restored.
Clustered Data ONTAP procedure to bring system down:
Notify, disconnect and, if needed, shut down all of the connected CIFS/NFS clients.
If there are any hosts that have FCP or iSCSI-based LUNs, shut them down before
shutting down the storage system.
If running ONTAP version prior to 8.2, perform the following steps:
If on a 2-node cluster, run the following:
::> cluster ha modify configured false
::> storage failover modify node * -enabled false
Log in to all nodes, one at a time (preferably using serial console or RLM/SP) and
run:
::> halt local -inhibit-takeover true
The following will appear after running the halt command above. Type 'y' when
prompted if you want to continue:
(system node halt)
41
Every node might take several minutes to shut down. Each node should then reset
and return to the LOADER> prompt. If there is no console or RLM/SP access, you
should confirm the overall node down status before halting the final node, by
running the system node show command. After the last node is halted, you can
power down everything safely.
Physically power down the head, then all the attached disk shelves as needed.
Physically unplug the cables from power supplies on the back of the storage
systems and shelves, to avoid any electrical issues when external power is restored.
Procedure to bring the system back online:
Reconnect all the power cables if previously disconnected.
Power on core switches.
Physically power up all disk shelves first. Wait until 30 seconds after the last disk
shelf is powered on, then power on the storage system head so that all disks will be
available when they are required by Data ONTAP.
Verify the storage system is up, all services are running, and network connectivity is
present.
For 7-Mode HA pairs, if cluster was disabled using cf disable, enter cf enable and
monitor with cf status.
For clustered ONTAP systems, check cluster show and storage failover show to
confirm CFO/SFO is configured/enabled.
- If on version prior to 8.2 in which cluster ha and/or storage failover were disabled,
run the following commands:
::> cluster ha modify configured true
::> storage failover modify node * enabled true
NFS
I have a question regarding volume and export paths with controllers in an HA setup.
For general purposes, this is a pair of FAS2240 controllers with OnTAP 8.1RC3
Lets suppose we have two volumes on each controller:
Controller A:
/vol/vol0 (root volume)
/vol/data0 (some generic data, NFS export)
Controller B:
42
A) During takeover virtual copy of partner filer is started with all its resources
If a disk shelf or loop contains a mix of disks owned by Node A and Node B, you must use this
procedure to move the data and make disk ownership uniform within the disk shelf or loop.
Before you begin
1. Use the following command to identify any disk shelves or loops that contain both disks
belonging to Node A and disks belonging to Node B:
diskshowv
2. Determine which node the disk shelf or loop with mixed ownership will be attached to when
the HA feature is unconfigured and record this information.
43
For example, if the majority of the disks in the loop belong to Node A, you probably want the
entire loop to belong to stand-alone Node A.
After you finish
You must determine the current configuration of the storage controller because the controller
failover and HA mode states can vary. You can use the cfstatuscommand to determine the
current configuration.
You must also confirm that all loops and disk shelves in the system contain disks that belong to
only one of the two nodes you intend to isolate. If any disk shelves or loops contain a mix of
disks belonging to both nodes, you must move data.
About this task
When a storage controller is shipped from the factory or when Data ONTAP is reinstalled using
option four of the Data ONTAP boot menu (Cleanconfigurationandinitializeall
disks), HA mode is enabled by default, and the system's nonvolatile memory (NVRAM or
NVMEM) is split. If you plan to use the controller in standalone mode, you must configure the
node as non-HA. Reconfiguring the node as non-HA mode enables full use of the system's
nonvolatile memory.
Note: Configuring the node as standalone removes the availability benefits of the HA
Example
node>cfstatus
NonHAmode.ReboottousefullNVRAM.
44
nonvolatile memory:
Example
node>cfstatus
ControllerFailoverenabled
Original entry:
ifconfige0199.9.204.254partner199.9.204.255
Edited entry:
ifconfige0199.9.204.254
45
Note: Most HA pair interfaces are configured as shared interfaces because they do not require an
extra NIC.
Both nodes in the HA pair must have interfaces that access the same collection of networks and
subnetworks.
You must gather the following information before configuring the interfaces:
The IP address for both the local node and partner node.
The netmask for both the local node and partner node.
The MTU size for both the local node and partner node.
The MTU size must be the same on both the local and partner interface.
Note: You should always use multiple NICs with interface groups to improve networking
availability for both stand-alone storage systems and systems in an HA pair.
Network configuration changes made by using the ifconfigcommand are not automatically
included in the /etc/rcfile. To make the configuration changes persistent after reboots,
include the ifconfigcommand in the /etc/rcfile.
When you configure an IP address, your storage system creates a network mask based on the
46
To configure a quad-port Ethernet interface e3a to use the IPv4 address 192.0.2.10, enter the
following command:
ifconfige3a192.0.2.10
To configure a quad-port Ethernet interface e3a to use the IPv6 address 2001:0db8:35ab:0:8a2e:
0:0370:85, enter the following command:
ifconfige3a2001:0db8:35ab:0:8a2e:0:0370:85
When specifying the partner IP address, both the local network interface and the partners
network interface must be attached to the same network segment or network switch.
About this task
If the network interface is an interface group, the partner interface must be denoted by an
interface name and not an IP address.
The partner interface can be an interface group or a physical network interface.
You cannot specify the underlying physical ports of an interface group in a partner
configuration.
If IPv6 addresses are to be taken over, you must specify the partner interface, and not an IP
address.
Address to address mapping is not supported for IPv6 addresses.
For the partner configuration to be persistent across reboots, you must include the ifconfig
command in the /etc/rcfile.
For a successful takeover in both directions, you must repeat the partner configuration in
the /etc/rcfiles of each node.
When specifying the partner interface name, you can configure the interfaces symmetrically, for
example map interface e1 on one node to interface e1 on the partner node.
Though symmetrical configuration is not mandatory, it simplifies administration and
troubleshooting tasks.
47
option. When taking over its partner, the node uses the partner's /etc/mcrcfile to configure
partner addresses locally. These addresses will reside on the local subnetwork.
48
Config Advisor is a configuration validation and health check tool for NetApp systems. It can be
deployed at both secure sites and non-secure sites for data collection and system analysis.
Note: Support for Config Advisor is limited, and available only online.
Steps
1. Log in to the NetApp Support Site at support.netapp.com and go to Downloads > Utility
ToolChest.
2. Click Config Advisor (WireGauge renamed).
3. Follow the directions on the web page for downloading, installing, and running the utility.
4. After running Config Advisor, review the tool's output and follow the recommendations to
address any issues discovered.
1. Check the cabling on the HA interconnect cables to make sure that they are secure.
2. Verify that you can create and retrieve files on both nodes for each licensed protocol.
3. Enter the following command from the local node console:
cftakeover
The local node releases the partner node, which reboots and resumes normal operation. The
following message is displayed on the console when the process is complete:
givebackcompleted
@@@
print
49
Data ONTAP will not attempt a partner takeover when it can determine prior to the takeover
attempt that the takeover will fail.
The resulting ASUP will automatically open a customer support case with "TAKEOVER
IMPOSSIBLE" in the symptom field. The case symptom text will be of the form:
CLTFLT: Cluster Notification from (PARTNER DOWN, TAKEOVER IMPOSSIBLE) ERROR
This article describes how to diagnose five common causes of takeover impossible events, and
the actions required to correct the issues found. The focus is on remote diagnosis from ASUP
logs, primarily the MESSAGES and CLUSTER-MONITOR logs.
Approximately 70% of NetApp FAS3000, FAS3100 and FAS6000 systems are deployed as High
Availability (HA) configurations. Proper configuration of HA systems requires installing all
necessary HA hardware, enabling cluster software licenses, setting HA related options, and more.
Data ONTAP will not attempt a partner takeover when it can determine prior to the takeover
attempt that the takeover will fail. The resulting ASUP will automatically open a customer
support case with "TAKEOVER IMPOSSIBLE" in the symptom field. The case symptom text
will be of the form:
CLTFLT: Cluster Notification from (PARTNER DOWN, TAKEOVER IMPOSSIBLE) ERROR
Hourly alert messages will be posted to the console in many instances if the HA system is not
configured properly and takeover by the partner system is not possible. The messages will be of
the form: "statd:ALERT Cluster is licensed but takeover of partner is disabled." This article
describes several common messages and actions required to correct the configuration issues.
The focus is on remote diagnosis from ASUP logs, primarily the MESSAGES and CLUSTERMONITOR logs.
Five common types of statd:ALERT messages are described below:
50
Cluster is licensed but takeover of partner is disabled due to reason: CFO not
licensed
The most common reason that systems report this message is the takeover functionality has been
disabled manually by an operator. An operator has entered cf disable from the console command
line. Entering cf enable will re-enable takeover and clear the hourly ALERT message.
To confirm that takeover has been disabled by the operator, check the ASUP CLUSTERMONITOR log. The fifth entry in the log begins with "takeoverByPartner". If takeover has
been manually disabled, the entry will contain the text string:
"NVRAM_DOWN,CLUSTER_DISABLE"
Example:
===== CLUSTER MONITOR =====
cf: Current monitor status (28Jun2009 00:00:02):
partner 'NetApp1' VIA Interconnect is up (link 0 up, link 1 up)
state UP, time 90788045660, event CHECK_FSM, elem ChkMbValid (12)
mirrorConsistencyRequired TRUE
takeoverByPartner 0x2041
Cluster is licensed but takeover of partner is disabled due to reason: interconnect error
The ASUP MESSAGES log will have hourly entries of the form:
[: statd:ALERT]: Cluster is licensed but takeover of partner is disabled due
to reason : interconnect error
The interconnect link status is shown as the second line in the CLUSTER-MONITOR log. In the
examples below, the interconnect is not present or both links are down.
===== CLUSTER MONITOR =====
cf: Current monitor status (28Jun2009 00:00:01):
partner 'NetApp1', Interconnect not present <<< look here
51
<<<
look here
Another common abnormal condition shows the "partner" as "unknown".
===== CLUSTER MONITOR =====
cf: Current monitor status (28Jun2009 00:00:02):
partner 'unknown', VIA Interconnect is down (link 0 down, link 1 down)
<<<
look here
The corrective action required is to verify the interconnect cables/links are connected and active.
When the partner is reported as 'unknown', verify that the partner filer/platform is present and
active. If no partner system is present, then likely the system was once part of a HA pair, and
was improperly reconfigured as standalone. See the documentation (Removing an active/active
configuration) for more information about how to properly split a cluster and clear the 'unknown'
partner messages.
Cluster is licensed but takeover of partner is disabled due to reason: partner mailbox disks
not accessible or invalid
The ASUP MESSAGES log will have hourly entries of the form:
[ statd:ALERT]: Cluster is licensed but takeover of partner is disabled due to
reason : partner mailbox disks not accessible or invalid
The status of the mailbox disks is shown approximately 15 lines from the top of the CLUSTERMONITOR log. A normal entry will show the disk paths for all of the mailbox disks. An
example below is provided for illustration. The disk identifiers (4a.17, 4a.29, 8b.34, 8b.35 in the
example) will vary depending on the system configuration.
mailbox disks:
Disk 4a.17 is a
Disk 4a.29 is a
Disk 8b.34 is a
Disk 8b.35 is a
primary
primary
partner
partner
mailbox
mailbox
mailbox
mailbox
disk
disk
disk
disk
52
Disk 8a.20 is a local mailbox disk
Disk 8a.19 is a local mailbox disk
No partner disks attached!
<<< look here
To correct these fault conditions, first check that the partner system is present and active. Then
check the FC adapters in the filers/platforms and shelf cabling to each of the mailbox disk
shelves.
If the problem continues, check if 'partner-sysid' shows a correct partnersysid.
CFE> printenv
Variable Name
-------------------BOOT_CONSOLE
fcal-host-id
partner-sysid
Value
-------------------------------------------------rlm0a
7
0101183784
Then attempt the following steps, this should be done on both HA controllers:
1. Disable clustering by typing cf disable.
2. Reboot
3. Press Ctrl-C during the boot sequence to go to the special boot menu.
4. Select option 5 to go into Maintenance mode.
5. Type: mailbox destroy local
6. Type: mailbox destroy partner
7. Type: halt
8. Reboot the head.
9. Type: cf enable
10.Type: ic stats error -v
53
Note: Possible stale mailbox instance on local/remote site results with the following message on
the storage system: [ds-dt01terra: fmmbx_instanceWorke:info]: missing lock disks,
possibly stale mailbox. After reassigning the drives during an upgrade, no mailbox disks
were visible. Missing mailbox disks. The local and remote instance of mailbox disks need to be
re-initialized. Perform the steps 1 to 10 above, on both the nodes.
A useful tool to help in the diagnosis of disk pathing issues is Config Advisor (WireGauge
renamed), which is available from the NOW ToolChest.
WireGauge can be run remotely by entering an ASUP ID. (Enter an ASUP ID by selecting "File
> Get ASUP"). Comparing WireGauge results from both HA partners will often indicate the
cause of the mailbox disk path issue.
Cluster is licensed but takeover of partner is disabled due to reason: CFO not licensed
The ASUP MESSAGES log will have hourly entries of the form:
[: statd:ALERT]: Cluster is licensed but takeover of partner is disabled due
to reason : CFO not licensed
If the CLUSTER-MONITOR log contains the following message, the cluster license is not
enabled.
===== CLUSTER MONITOR =====
Clustered failover is now unlicensed
cf: option 'monitor' requires that cluster licensing is enabled
Re-enabling the cluster license will clear this error. See Enabling licenses for more details.
A common cause is the system was once part of a High Availability pair, and was improperly
reconfigured as standalone. See Removing an active/active configuration for more information
about how to properly split a High Availability pair.
Cluster is licensed but takeover of partner is disabled due to reason: unsynchronized log
The ASUP MESSAGES log will have hourly entries of the form:
[: statd:ALERT]: Cluster is licensed but takeover of partner is disabled due
to reason : unsynchronized log
54
systems, the two interconnect ports are on the NVRAM card. Verify Port 0 is connected to Port
0, and Port 1 to Port 1, on each system in the HA pair.
In some instances, momentarily unplugging and reseating each interconnect cable will clear this
error. Breaking and reestablishing the interconnect link will force the logs to re-synchronize.
Changes in High Availability 'Takeover Impossible' events in Data ONTAP 8.x
1. There are additional EMS messages that describe reason for takeover
impossible. The messages start with 'ha.takeoverImp'.
ha.takeoverImpIC:warning]: Takeover of the partner node is impossible
because of interconnect errors.
ha.takeoverImpNotDef:warning]: Takeover of the partner node is impossible
due to reason status of backup mailbox is uncertain.
ha.takeoverImpNotDef:warning]: Takeover of the partner node is impossible
due to reason partner booting.
ha.takeoverImpUnsync:warning]: Takeover of the partner node is impossible
due to lack of partner NVRAM data.
ha.takeoverImpNotDef:warning]: Takeover of the partner node is impossible
due to reason partner halted in notakeover mode.
2. The hourly takeover disabled message changed in Data ONTAP 8. See the
following link:
Syslog Translator
55
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
How to power down and power up the controllers in a 7-Mode HA-Pair
Description
ii.
The serial numbers of the storage systems will be required. This can be
obtained by running the sysconfig command on the console or
opening OnCommand System Manager and observing the serial
number listed.
56
iii.
Ensure that critical client applications are terminated and users are
warned prior to proceeding with the shutdown of the storage systems.
2. Power Down:
There are two ways to power down the storage systems in a cluster:
i.
The result should show that the cluster is disabled. Users will not
suffer an interruption when a cf disable or cf enable command
is executed.
If CIFS is in use, check the CIFS sessions and stop CIFS services
by entering (per vfiler if MultiStore is in use):
[vfiler run *] cifs sessions
[vfiler run *] cifs terminate -t 0
OR
any number of minutes you wish to wait.
57
The results should show that the cluster is enabled. The cluster
should now be up and running.
ii.
58
3. Power Up:
i.
ii.
iii.
iv.
v.
Telnet to one of the storage systems, and run the command, cf enable.
This will re-enable clustering. (additional information in Part V)
vi.
vii.
ii.
The serial number of your storage systems will be required. This can be
obtained by typing sysconfig on the console or opening FilerView >
Status.
halt:
The halt flushes all cached data to disk, turns off the non-volatile RAM, and
drops into the monitor. Any time you power-off the storage system, run the
halt command to conserve the batteries on the non-volatile RAM.
NFS clients can maintain use of a file over a halt or reboot (although
experiencing a failure to respond during that time), but CIFS clients cannot
perform so safely.
If the storage system is running CIFS, the halt command invokes cifs
terminate, which requires the -t option. If the storage system has CIFS clients
and you invoke halt without -t, it displays the number of CIFS users and the
number of open CIFS files. Then it prompts you for the number of minutes to
delay. cifs terminate automatically notifies all CIFS clients that a CIFS
shutdown is scheduled in X minutes, and asks them to close their open files.
CIFS files that are still open at the time the storage system halts will lose any
writes that had been cached but not written.
halt logs a message in /etc/messages to indicate that the storage system was
59
halted on purpose.
ii.
cf:
The cf command controls the cluster failover monitor, which determines when
the takeover and giveback operations take place within a cluster. The cf
command is available only if your storage system has the cluster license.
OPTIONS:
disable
Forces the live storage system to give back the resources of the
failed storage system even though the live storage system detects an error
that would prevent a complete giveback. For example, an error might prevent
the failed storage system from flushing data in the NVRAM to disk during a
giveback. If the live storage system detects this error, it does not perform a
giveback. However, using the forcegiveback option forces a giveback despite
such an error. When the failed storage system reboots as a result of a forced
giveback, it displays the following message:
partner giveback incomplete, some data may be lost
forcetakeover
Forces one storage system to take over its partner even though the storage
system detects an error that would otherwise prevent a takeover. For
example, normally, if a detached or faulty ServerNet cable between
the storage systems causes the storage system's NVRAM contents to be
unsynchronized, takeover is disabled. However, if you run the cf
forcetakeover command, the storage system takes over its partner despite
the unsynchronized NVRAM contents. This command might cause the storage
system being taken over to lose client data.
giveback [ -f ] initiates a giveback of partner resources. Once the giveback
Displays the host name of the partner. If the name is unknown, the cf
command displays partner.
status
Displays the current status of the local storage system and the cluster.
60
takeover