Beruflich Dokumente
Kultur Dokumente
White paper
Introduction......................................................................................................................................... 2 Serviceguard Network Managerdefault network failure detection mechanism .......................................... 2 Driver notification............................................................................................................................. 2 Traffic statistics and polling mechanism............................................................................................... 2 Inbound failure detection enhancement .................................................................................................. 3 Why the enhancement was made ...................................................................................................... 3 Configuration of network failure detection........................................................................................... 4 How inbound failure detection works.................................................................................................. 4 Broken cascaded cable scenario and how it would be handled ......................................................... 4 Algorithm for inbound failure detection method................................................................................ 5 Risks ........................................................................................................................................... 6 Examples of network configurations with risks .................................................................................. 6 Guidelines and recommendation.................................................................................................... 8 For more information............................................................................................................................ 9
Introduction
This white paper provides a brief description on the default behavior of the Serviceguard network failure detection mechanism. It also elaborates in detail the new enhancement to the network failure detection mechanism that deals with inbound-only failures. It explains why, how, and when the new enhanced method should be applied.
Driver notification
Whenever a driver sends any error notification to indicate that a card has failed, Serviceguard immediately declares that the card is bad and performs a failover, if applicable. For example, when a card fails to send due to a severed link, Serviceguard will immediately declare the card down on receiving the error notification.
The default method has been the only method of network failure detection in Serviceguard prior to version A.11.16. With this default method, Serviceguard will mark the card as failed if both inbound and outbound statistics of a LAN card stop incrementing for a specific amount of time. With this default method, Serviceguard will not mark the card as bad if only inbound statistics stop incrementing or if only outbound statistics stop incrementing. However, if a LAN card fails to send and its outbound statistics stop increasing, Serviceguard will usually receive an error notification from the driver and mark the card as down right away.
Figure 2. Polling paths in a 2-node clusters primary LAN data and heartbeat connection
The primary and standby LAN cards, lan0 and lan1, send poll packets to each other via the cascading cable (see Figure 2). Assume that the cascading cable gets disconnected. Now each LAN card can send but cannot receive any poll packets. There is no other incoming traffic that can increment the inbound statistics at the time. With the default method INOUT, Serviceguard would not detect the failure in this situation. Routers would direct clients to LAN cards that are connected, giving them access to applications that are running. However, if the design is changed to immediately failover when this happens, both primary and standby LAN cards would fail. The application would halt and network clients would experience downtime. If this is the only heartbeat network, both nodes would experience a Transfer of Control (TOCmore information on this is available in the Serviceguard manual). Therefore, Serviceguard Network Managers polling mechanism is enhanced to determine whether to fail-over or not, helping to avoid LAN card failure and similar problems. The mechanism, which is called full polling, sends poll packets to all network interfaces on the same bridged net in the cluster and waits to see if there is any reply. Full polling is described in more detail in Algorithm for inbound failure detection method.
Figure 3. An illustration of what happens when cascaded cable is broken if the new enhanced method of failure detection is applied
Algorithm for inbound failure detection method With the enhanced network failure detection method, Serviceguard Network Manager monitors the status of LAN cards by sending polling messages, just as it does with the existing default method. This polling mechanism generates reliable traffic, which Serviceguard uses to keep track of inbound and outbound statistics. If the statistics stop incrementing for a defined period of time, Network Manager determines what actions to take next. If the user sets NETWORK_FAILURE_DETECTION to INOUT, the existing method is used. With inbound-only failures, the Network Manager does not mark the LAN card as down and it does not begin a failover. If the user sets NETWORK_FAILURE_DETECTION to INONLY_OR_INOUT, the new enhanced method is used. When Network Manager detects that the inbound traffic of a LAN card has stopped, it does the following:
1. The Network Manger waits for a period of time, based on the LAN cards type.
2. After that time has passed, the Network Manager starts polling from the LAN card to all other
card and switch is healthy and theres no need for a failover. This can happen in the broken cascaded cable example used previously.
4. At any time, if Network Manager sees that the LAN card is again receiving messages from the
LAN card it had lost communication with, it determines that connectivity has resumed and discontinues the full polling.
5. If full polling does not increment the inbound statistics for the LAN card for a period of time, the
LAN card is not able to communicate with the rest of the network. Several possible failures could lead to this problem, but most can be solved by a LAN failover. Network Manager will make the assumption that a local failover is needed. It will mark the card DOWN and start the local switch procedure. In case there is no standby LAN, the packaged application can fail-over to another node if the subnet affected is configured in its packages SUBNET parameter and if the other node is configured to run it.
6. The LAN card will be marked UP when it can both send and receive messages again consistently.
This guards against frequent intermittent failures in a short time. However, if failures happen from time to timefor example, if several minutes pass between each failureServiceguard will failover the connection back and forth, since there is no way to tell if the problem is transient or not. This is the rule with both INONLY_OR_INOUT. For INONLY_OR_INOUT, it applies whether the card failed due to inbound traffic only or both inbound and outbound traffic. The switchback procedure is started once the card is back up if it is the primary LAN card. The following table summarizes Network Managers behaviors for each type of LAN card failure. Yes means the card is marked down in the particular situation.
INOUT Inbound fails Outbound fails Both fail No Yes (upon driver notice) Yes INONLY_OR_INOUT Yes Yes (upon driver notice) Yes
Risks There are cases in which this enhancement can actually make the situation worse by failing-over when the inbound traffic has stopped, especially when the network configuration is not highly available enough. Examples of network configurations with risks In all of the following examples, the assumption is that they happen during a time where the only traffic being generated and contributing to the statistics are Serviceguard polling messages. Also, all subnets in the cluster are monitored in the packages. Example 1: single point of failure created with INONLY_OR_INOUTThis is a real-life example from a customers environment. Enabling the new enhanced network failure detection functionality in this environment might create single points of failure that would not be there if the existing default method were applied. In Figure 4, with the new scheme applied, since switch C is not connected to the network via another path, switch E becomes a single point of failure. If it fails, the whole subnet will be marked as down, and clients will not be able to access the applications running on the two nodes. This does not happen without the new feature. In fact, with the existing behavior, none of the cards will be marked as down and, although they cannot access node B, clients will still be able to access applications on node A.
Example 2: another case of broken cascaded cableFor this example, consider a two-node cluster with only one LAN card on each node or only one connection for data traffic and no standby. Each LAN card connects to a switch and the two switches are cascading. These switches connect to routers, then to the client LAN. Assume the cascading cable breaks, so each LAN card can send out poll messages to each other but cannot receive. With INOUT set, Serviceguard will not mark the LAN cards as down, the applications will keep running, and clients will still have access to the applications. However, if INONLY_OR_INOUT is applied, Serviceguard will mark the whole subnet as down and fail the packages that have the subnet monitored. See Figure 5 for an illustration of this example.
Example 3: two nodes with a primary connection and no standbyFor this example, consider a twonode cluster connected with a primary connection, no standby and only one switch. Since there is no standby, the two LAN cards are remote polling. If one card fails, for example lan0 on node B, and there is no traffic other than the remote polling traffic that just stopped, the inbound statistics of lan0 on node A will not be incrementing. Once Serviceguards Network Manager on node A realizes this, it will declare that the card is bad and then mark the whole subnet as down. The customer will experience application downtime. See figure 6 for an illustration of this example.
Guidelines and recommendation As long as their network configuration is highly available, users should be able to avoid these or similar risky situations when applying the new enhancement. The following guidelines help determine what can be done to have a highly available network configuration. Test your cluster thoroughly before and after setting the parameter to INONLY_OR_INOUT. Run the application on the cluster to be sure cluster network configuration is suitable for this option. The network environment where the enhanced failure detection method is applied has considerable impact on Serviceguard so, before applying the new feature, be sure that the following conditions are met: All local bridged nets in each node in the cluster have at least two interfaces each Each primary interface has at least one standby interface that is connected to a switch other than the one the primary switch is connected to The primary switch is directly connected to its standby The standby switch is connected to a router through a different path than the primary switch There is no single point of failure anywhere in all of the bridged nets When the new enhanced method of failure detection is applied in an appropriate network environment, users will be able to avoid application hang caused by inbound-only failures when LAN cards or switches fail without risks.
2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. 5982-6590EN, 06/2004