Sie sind auf Seite 1von 9

Serviceguard Network Manager

White paper

Introduction......................................................................................................................................... 2 Serviceguard Network Managerdefault network failure detection mechanism .......................................... 2 Driver notification............................................................................................................................. 2 Traffic statistics and polling mechanism............................................................................................... 2 Inbound failure detection enhancement .................................................................................................. 3 Why the enhancement was made ...................................................................................................... 3 Configuration of network failure detection........................................................................................... 4 How inbound failure detection works.................................................................................................. 4 Broken cascaded cable scenario and how it would be handled ......................................................... 4 Algorithm for inbound failure detection method................................................................................ 5 Risks ........................................................................................................................................... 6 Examples of network configurations with risks .................................................................................. 6 Guidelines and recommendation.................................................................................................... 8 For more information............................................................................................................................ 9

Introduction
This white paper provides a brief description on the default behavior of the Serviceguard network failure detection mechanism. It also elaborates in detail the new enhancement to the network failure detection mechanism that deals with inbound-only failures. It explains why, how, and when the new enhanced method should be applied.

Serviceguard Network Managerdefault network failure detection mechanism


The network failure detection scheme in Serviceguards Network Manager involves network driver error handling and a polling mechanism. While doing its periodic check on a clusters network interfaces, Network Manager looks at the following to detect network interface failures: Network driver error notification Traffic statistics

Driver notification
Whenever a driver sends any error notification to indicate that a card has failed, Serviceguard immediately declares that the card is bad and performs a failover, if applicable. For example, when a card fails to send due to a severed link, Serviceguard will immediately declare the card down on receiving the error notification.

Traffic statistics and polling mechanism


Serviceguards Network Manager keeps track of increments in the number of messages coming in and out of network interfaces used in a cluster. In addition to other network traffic, Serviceguard generates its own polling messages every NETWORK_POLLING_INTERVAL. These messages go between the poller, which is usually the standby, and other cards on the same bridged net on the node. This polling traffic makes sure Network Manager can rely on a steady stream of traffic to determine if there is any problem with the LAN interfaces. In Figure 1, polling happens between lan1 and lan0 on each node.

Figure 1. Typical network configuration for a 2-node cluster

The default method has been the only method of network failure detection in Serviceguard prior to version A.11.16. With this default method, Serviceguard will mark the card as failed if both inbound and outbound statistics of a LAN card stop incrementing for a specific amount of time. With this default method, Serviceguard will not mark the card as bad if only inbound statistics stop incrementing or if only outbound statistics stop incrementing. However, if a LAN card fails to send and its outbound statistics stop increasing, Serviceguard will usually receive an error notification from the driver and mark the card as down right away.

Inbound failure detection enhancement


A new enhanced method of failure detection is introduced in Serviceguard A.11.16 so that inboundonly failures will be handled.

Why the enhancement was made


Users can experience application hang when a network card does not receive but still sends. Some users expect that this situation would be solved by a local LAN failover. However, as described above, the design of Serviceguard Network Manager prior to version A.11.16 does not include this feature. A LAN card would not be marked as bad if only its inbound message count stops increasing because there can be situations where there is no deterministic way to tell exactly why the LAN card stopped receiving. It could be that the cascaded cable is broken (see Figure 3), it could be that there is a problem with the logic inside the switch, or it could be a problem with the card itself. Nonetheless, due to customer requests, the Serviceguard team decided to enhance the failure detection mechanism so that the network card would be declared down in certain situations when only inbound traffic stops. However, customers must remember that there are cases when this new enhancement should not be applied.

Configuration of network failure detection


The network failure detection method is configurable, so users can choose between the existing default method and the new enhanced method. Also, users will be able to apply the configuration online and make their choice without having to bring the cluster down. They can do this by modifying the cluster configuration file and setting the value for the NETWORK_FAILURE_DETECTION parameter. NETWORK_FAILURE_DETECTION is a global parameter that affects the behavior of all LAN cards configured in the cluster regardless of whether they are primary, standby, heartbeat, or data LANs. If the value is INOUT, the existing default method will be applied. If it is INONLY_OR_INOUT, the enhanced method will be applied. With INONLY_OR_INOUT, Network Manager checks the inbound and outbound traffic as before. If both inbound and outbound traffic stop incrementing, it marks the card as down, just as it does with the default method. Whats new is that, if just the inbound value stops incrementing, it will start a process to determine if inbound traffic has actually failed and if a local switch is applicable, using the algorithm described in How inbound failure detection works. The feature is applicable to Ethernet, token ring, FDDI, and all types of LAN cards that Serviceguard supports for local switchesmore specifically, types of LAN cards that support the Data Link Provider Interface (DLPI).

How inbound failure detection works


Broken cascaded cable scenario and how it would be handled Lets look at an example of a two-node cluster in which each node is configured with two LAN cards for both heartbeat and data traffic. The primary LAN cards are connected to a primary switch, the standby LAN cards are connected to a standby switch, and the two switches cascade by crossover cable. These switches connect to routers, then to clients.

Figure 2. Polling paths in a 2-node clusters primary LAN data and heartbeat connection

The primary and standby LAN cards, lan0 and lan1, send poll packets to each other via the cascading cable (see Figure 2). Assume that the cascading cable gets disconnected. Now each LAN card can send but cannot receive any poll packets. There is no other incoming traffic that can increment the inbound statistics at the time. With the default method INOUT, Serviceguard would not detect the failure in this situation. Routers would direct clients to LAN cards that are connected, giving them access to applications that are running. However, if the design is changed to immediately failover when this happens, both primary and standby LAN cards would fail. The application would halt and network clients would experience downtime. If this is the only heartbeat network, both nodes would experience a Transfer of Control (TOCmore information on this is available in the Serviceguard manual). Therefore, Serviceguard Network Managers polling mechanism is enhanced to determine whether to fail-over or not, helping to avoid LAN card failure and similar problems. The mechanism, which is called full polling, sends poll packets to all network interfaces on the same bridged net in the cluster and waits to see if there is any reply. Full polling is described in more detail in Algorithm for inbound failure detection method.

Figure 3. An illustration of what happens when cascaded cable is broken if the new enhanced method of failure detection is applied

Algorithm for inbound failure detection method With the enhanced network failure detection method, Serviceguard Network Manager monitors the status of LAN cards by sending polling messages, just as it does with the existing default method. This polling mechanism generates reliable traffic, which Serviceguard uses to keep track of inbound and outbound statistics. If the statistics stop incrementing for a defined period of time, Network Manager determines what actions to take next. If the user sets NETWORK_FAILURE_DETECTION to INOUT, the existing method is used. With inbound-only failures, the Network Manager does not mark the LAN card as down and it does not begin a failover. If the user sets NETWORK_FAILURE_DETECTION to INONLY_OR_INOUT, the new enhanced method is used. When Network Manager detects that the inbound traffic of a LAN card has stopped, it does the following:
1. The Network Manger waits for a period of time, based on the LAN cards type.

2. After that time has passed, the Network Manager starts polling from the LAN card to all other

cards on the same bridged net in the cluster.


3. If the LAN card gets a reply back from any peer LAN card, the card and connectivity between the

card and switch is healthy and theres no need for a failover. This can happen in the broken cascaded cable example used previously.
4. At any time, if Network Manager sees that the LAN card is again receiving messages from the

LAN card it had lost communication with, it determines that connectivity has resumed and discontinues the full polling.
5. If full polling does not increment the inbound statistics for the LAN card for a period of time, the

LAN card is not able to communicate with the rest of the network. Several possible failures could lead to this problem, but most can be solved by a LAN failover. Network Manager will make the assumption that a local failover is needed. It will mark the card DOWN and start the local switch procedure. In case there is no standby LAN, the packaged application can fail-over to another node if the subnet affected is configured in its packages SUBNET parameter and if the other node is configured to run it.
6. The LAN card will be marked UP when it can both send and receive messages again consistently.

This guards against frequent intermittent failures in a short time. However, if failures happen from time to timefor example, if several minutes pass between each failureServiceguard will failover the connection back and forth, since there is no way to tell if the problem is transient or not. This is the rule with both INONLY_OR_INOUT. For INONLY_OR_INOUT, it applies whether the card failed due to inbound traffic only or both inbound and outbound traffic. The switchback procedure is started once the card is back up if it is the primary LAN card. The following table summarizes Network Managers behaviors for each type of LAN card failure. Yes means the card is marked down in the particular situation.
INOUT Inbound fails Outbound fails Both fail No Yes (upon driver notice) Yes INONLY_OR_INOUT Yes Yes (upon driver notice) Yes

Risks There are cases in which this enhancement can actually make the situation worse by failing-over when the inbound traffic has stopped, especially when the network configuration is not highly available enough. Examples of network configurations with risks In all of the following examples, the assumption is that they happen during a time where the only traffic being generated and contributing to the statistics are Serviceguard polling messages. Also, all subnets in the cluster are monitored in the packages. Example 1: single point of failure created with INONLY_OR_INOUTThis is a real-life example from a customers environment. Enabling the new enhanced network failure detection functionality in this environment might create single points of failure that would not be there if the existing default method were applied. In Figure 4, with the new scheme applied, since switch C is not connected to the network via another path, switch E becomes a single point of failure. If it fails, the whole subnet will be marked as down, and clients will not be able to access the applications running on the two nodes. This does not happen without the new feature. In fact, with the existing behavior, none of the cards will be marked as down and, although they cannot access node B, clients will still be able to access applications on node A.

Figure 4. Single point of failure

Example 2: another case of broken cascaded cableFor this example, consider a two-node cluster with only one LAN card on each node or only one connection for data traffic and no standby. Each LAN card connects to a switch and the two switches are cascading. These switches connect to routers, then to the client LAN. Assume the cascading cable breaks, so each LAN card can send out poll messages to each other but cannot receive. With INOUT set, Serviceguard will not mark the LAN cards as down, the applications will keep running, and clients will still have access to the applications. However, if INONLY_OR_INOUT is applied, Serviceguard will mark the whole subnet as down and fail the packages that have the subnet monitored. See Figure 5 for an illustration of this example.

Figure 5. No standby LAN cards, cascaded cable broken

Example 3: two nodes with a primary connection and no standbyFor this example, consider a twonode cluster connected with a primary connection, no standby and only one switch. Since there is no standby, the two LAN cards are remote polling. If one card fails, for example lan0 on node B, and there is no traffic other than the remote polling traffic that just stopped, the inbound statistics of lan0 on node A will not be incrementing. Once Serviceguards Network Manager on node A realizes this, it will declare that the card is bad and then mark the whole subnet as down. The customer will experience application downtime. See figure 6 for an illustration of this example.

Figure 6. No standby LAN cards, single switch, LAN card fails

Guidelines and recommendation As long as their network configuration is highly available, users should be able to avoid these or similar risky situations when applying the new enhancement. The following guidelines help determine what can be done to have a highly available network configuration. Test your cluster thoroughly before and after setting the parameter to INONLY_OR_INOUT. Run the application on the cluster to be sure cluster network configuration is suitable for this option. The network environment where the enhanced failure detection method is applied has considerable impact on Serviceguard so, before applying the new feature, be sure that the following conditions are met: All local bridged nets in each node in the cluster have at least two interfaces each Each primary interface has at least one standby interface that is connected to a switch other than the one the primary switch is connected to The primary switch is directly connected to its standby The standby switch is connected to a router through a different path than the primary switch There is no single point of failure anywhere in all of the bridged nets When the new enhanced method of failure detection is applied in an appropriate network environment, users will be able to avoid application hang caused by inbound-only failures when LAN cards or switches fail without risks.

For more information


To learn more about HP high-availability products and solutions, please visit: www.hp.com/go/serviceguard

2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. 5982-6590EN, 06/2004

Das könnte Ihnen auch gefallen