C 02020743

HP-UX 11i TCP/IP Performance White Paper
Introduction.............................................................................................................................3 1.1 Intended Audience ............................................................................................................3 1.2 Organization of the document.............................................................................................3 1.3 Related Documents............................................................................................................3 1.4 Acknowledgements: ..........................................................................................................4 Out 2.1 2.2 2.3 2.4 2.5 2.6 of the Box TCP/IP Performance Features for HP-UX Servers........................................................5 TCP Window Size and Window Scale Option (RFC 1323) ......................................................5 Selective Acknowledgement (RFC 2018) ...............................................................................6 Limited Transmit (RFC 3042) ...............................................................................................6 Large Initial Congestion Window (RFC 3390)........................................................................6 TCP Segmentation Offload (TSO).........................................................................................7 Packet Trains for IP fragments..............................................................................................7
Advanced Out of the Box Scalability and Performance Features.......................................................9 3.1 TOPS..............................................................................................................................9 3.1.1 Configuration Scenario for TOPS ...................................................................................9 3.1.2 socket_enable_tops Tunable ..........................................................................................9 3.2 STREAMS NOSYNC Level Synchronization ......................................................................... 10 3.2.1 IP NOSYNC synchronization....................................................................................... 10 3.3 Protection from Packet Storms............................................................................................ 11 3.3.1 Detect and Strobe Solution .......................................................................................... 11 3.3.2 HP-UX Networking Responsiveness Features ................................................................... 11 3.3.3 Responsiveness Tuning ............................................................................................... 12 3.4 Interrupt Binding and Migration......................................................................................... 12 3.4.1 Configuration Scenario for Interrupt Migration................................................................ 13 3.4.2 Cache Affinity Improvement ........................................................................................ 13 Improving HP-UX Server Performance ........................................................................................ 14 4.1 Tuning Application and Database Servers ........................................................................... 14 4.1.1 Tuning Application Servers.......................................................................................... 15 4.1.2 Tuning Database Servers ............................................................................................ 17 4.2 Tuning Web Servers ........................................................................................................ 18 4.2.1 Network Server Accelerator HTTP................................................................................. 18
4.2.2 Socket Caching for TCP Connections ............................................................................ 20 4.2.3 Tuning tcphashsz....................................................................................................... 21 4.2.4 Tuning the listen queue limit......................................................................................... 22 4.2.5 Using MSG_EOF flag for TCP Applications .................................................................... 24 4.3 Tuning Servers in Wireless Networks.................................................................................. 25 4.3.1 Smoothed RTO Algorithm............................................................................................ 25 4.3.2 Forward-Retransmission Timeout (F-RTO) ........................................................................ 25 5 Tuning Applications Using Programmatic Interfaces ..................................................................... 27 5.1 sendfile() ....................................................................................................................... 27 5.2 Polling Events ................................................................................................................. 27 5.3 send() and recv() Socket Buffers ......................................................................................... 28 5.3.1 Data Buffering in Sockets ............................................................................................ 28 5.3.2 Controlling Socket Buffer Limits .................................................................................... 28 5.3.3 System Socket Buffer Tunables ..................................................................................... 29 5.4 Effective use of the listen backlog value............................................................................... 29 Monitoring Network Performance ............................................................................................. 31 6.1 Monitoring network statistics ............................................................................................. 31 6.1.1 Monitoring TCP connections with netstat an .................................................................. 32 6.1.2 Monitoring protocol statistics with netstat -p.................................................................... 32 6.1.3 Monitoring link level statistics with lanadmin .................................................................. 34 6.2 Monitoring System Resource Utilization............................................................................... 35 6.2.1 Monitoring CPU Utilization Using Glance ...................................................................... 35 6.2.2 Monitoring CPU statistics using Caliper ......................................................................... 36 6.2.3 Monitoring Memory Utilization using Glance:................................................................. 37 6.2.4 Monitoring Memory utilization using vmstat .................................................................. 38 6.2.5 Monitoring Cache Miss Latency ................................................................................... 38 6.2.6 Monitoring other resources.......................................................................................... 38 6.3 Measuring Network Throughput ........................................................................................ 38 6.3.1 Measuring Throughput with Netperf Bulk Data transfer..................................................... 39 6.3.2 Measuring Transaction Rate with Netperf request/response: ............................................. 39 6.3.3 Key issues for throughput with Netperf traffic.................................................................. 40 6.4 Additional Monitoring Tools.............................................................................................. 40
Appendix A: Annotated output of netstat s (TCP, UDP, IP, ICMP).......................................................... 41 Appendix B: Annotated output of ndd h and discussions of the TCP/IP tunables ..................................... 53 Table 1: Summary of TCP/IP Tunables.............................................................................................. 81 Table 2: Operating System Support for TCP/IP Tunables...................................................................... 85 Revision History ............................................................................................................................ 89
Introduction
This white paper is intended as a guide to tuning networking performance at the network and transport layers. This includes IPv4, IPv6, TCP, UDP, and related protocols. Some topics will touch on other areas including sockets interfaces, network interface drivers and application protocols; however that is not the focus of this paper. Other information is available for these subsystems as referenced below.
1.1 Intended Audience

This whitepaper is intended for the following: Administrators responsible for supporting or tuning the internal workings of the HP-UX networking stack Network programmers, for example those who directly write to the TCP or UDP protocols using socket system calls HP-UX network and system administrators who want to supplement their knowledge of HP-UX configuration options
NOTE: This white paper is specific to performance tuning, and is not a general guide to HP-UX network administration.
1.2 Organization of the document

This document is organized as follows: Chapter 1: Introduction. Chapter 2: Provides information on out of the box TCP/IP performance features. Chapter 3: Provides information on advanced out of the box scalability and performance features. Chapter 4: Provides recommendations on tuning HP-UX servers. Chapter 5: Provides information on tuning applications using programmatic interfaces. Chapter 6: Provides information on how to monitor and troubleshoot network performance on HP- UX. Appendix A: Provides detailed description of netstat statistics and related tuning. Appendix B: Provides detailed description of TCP/IP ndd tunables.
1.3 Related Documents

The following documentation supplements information in this document: HP-UX 11i v3 performance further increases your productivity for improved IT business value http://h71028.www7.hp.com/ERC/downloads/4AA1-0961ENW.pdf Performance and Recommended Use of AB287A 10 Gigabit Ethernet Cards http://docs.hp.com/en/10gigEwhitepaper.pdf/10Gige_arches_whitepaper_version5.pdf HP Auto Port Aggregation Performance and Scalability White Paper http://docs.hp.com/en/7662/new-apa-white-paper.pdf Network Server Accelerator White Paper http://www.docs.hp.com/en/NSAWP-90902/NSAWP-90902.pdf RFCs related to TCP/IP Performance: http://www.ietf.org/rfc.html RFC 1323: TCP Extensions for High Performance RFC 2001: TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms
RFC 2018: TCP Selective Acknowledgement Options RFC 2861: TCP Congestion Window Validation RFC 3042: Enhancing TCP's Loss Recovery Using Limited Transmit RFC 3390: Increasing TCP's Initial Window RFC 3782: The NewReno Modification to TCP's Fast Recovery Algorithm RFC 4138: Forward RTO-Recovery (F-RTO): An Algorithm for Detecting Spurious Retransmission Timeouts
1.4 Acknowledgements:
Most of the information in Appendix A and B has been derived from the Annotated Output of 'ndd -h' and 'netstat -s" documents written by Rick Jones. You can find these documents at ftp://ftp.cup.hp.com/dist/networking/briefs/annotated_ndd.txt ftp://ftp.cup.hp.com/dist/networking/briefs/annotated_netstat.txt
Out of the Box TCP/IP Performance Features for HP-UX Servers
The HP-UX Networking Stack is especially engineered and tested for optimum performance in an enterprise mission-critical environment. HP-UX 11i v3 exhibits excellent performance on NFS server performance and in the TPC-C benchmark, a measurement of intensive online transaction processing (OLTP) in a database environment. Typically, OLTP includes a mixture of read-only or update, short or long, and interactive or deferred database transactions. There are significant networking performance tuning improvements and optimizations in 11i v3 for database applications as demonstrated by the benchmark result. There are many out of the box performance features introduced in HP-UX 11i. Users do not need to configure or tune any attributes in order to see the performance improvement with these out of the box performance features. The networking stack gracefully adapts to different networking needs in an enterprise, from noisy low-bandwidth wireless environments to high-bandwidth high-throughput datacenter environments. The TCP/IP performance features described in this chapter improve the performance of HP-UX servers, including database servers, application servers, NFS servers, web servers, mail servers, DNS servers, ftp servers, DHCP servers, gateways, and firewall systems.
2.1 TCP Window Size and Window Scale Option (RFC 1323)
TCP performance depends not only on the transfer rate itself, but also on the product of the link bit rate and the round-trip delay, or latency. This "bandwidth-delay product" measures the amount of data that would "fill the pipe"; it is the buffer space required on the sender and receiver systems to obtain maximum throughput on the TCP connection over the path, i.e., the amount of unacknowledged data that TCP must handle in order to keep the pipeline full. TCP performance problems arise when the bandwidth-delay product is large. We refer to an Internet path operating in this region as a "long, fat pipe". In order to improve the performance of a network with a large bandwidth-delay product, the TCP window size needs to be sufficiently large. HP-UX supports the TCP window scale option (RFC 1323), which increases the maximum TCP window size up to approximately 1 gigabyte, or 1,073,725,440 bytes (65,535 * 214). When HP-UX initiates a TCP connection (an active open), HP-UX always initiates a SYN segment with the window scale option. Even when the real window size is less than 65,536, the window scale option is used with the scale factor set to 0. It is because advertising a 64K window with a window scale option of 0 is better than advertising a 64K window without a window scale option as it tells the peer that the window scale option is supported. When HP-UX responds to a connection request (a passive open), HP-UX accepts SYN segments with the window scale option. For the receiving TCP, the default receive window size is set by the ndd tunable tcp_recv_hiwater_def. Applications can change the receive window size by the SO_RCVBUF socket option. To fully utilize the bandwidth, the receive window needs to be sufficiently large for a given bandwidth-delay product. For the sending TCP, the default send buffer size is set by the ndd tunable tcp_xmit_hiwater_def, and applications can change the size by the SO_SNDBUF setsockopt() option. By setting the send socket buffer
sufficiently large for a given bandwidth-delay product, the transport is better positioned to taking full advantage of the remote TCP's advertised window.
2.2 Selective Acknowledgement (RFC 2018)

TCP may experience poor performance when multiple packets are lost from one window of data. Selective Acknowledgment (SACK), described in RFC 2018, is effective in recovering from loss of multiple segments in a window. It accomplishes this by extending TCP's original, simple "ACK to the first hole in the data" algorithm with one that describes holes past the first lost segment. This information, sent from the receiver to the sender as an option field in the TCP header, allows the sender to retransmit lost segments sooner. In addition, the acknowledgment of segments after the first hole in sequence space using SACK allows the sender to avoid retransmission of segments which were not lost. SACK is configured in HP-UX with the ndd tunable tcp_sack_enable, which can be set to the following values: 0: 1: 2: Never initiate, nor accept the use of SACK Always ask for, and accept the use of SACK Do not ask for, but accept the use of SACK (Default)
The default value of 2 is somewhat conservative, as the system will not initiate the use of SACK on a connection. It may be necessary to keep this value in some cases, as other TCP implementations which do not support SACK may be improperly implemented and may not ignore this option when it is requested in a TCP SYN segment. However, if the remote initiates the connection and asks for SACK, HP-UX will honor that request. A tcp_sack_enable value of 1 should be used if you want the system to use SACK for those connections initiated from the system itself (i.e. applications on the system calling connect() or t_connect()).
2.3 Limited Transmit (RFC 3042)

HP-UX implements TCP Limited Transmit (RFC 3042), which provides faster recovery from packet loss. When a segment is lost, exercising the TCP Fast Retransmit algorithm (RFC 2001) is much faster than waiting for the TCP retransmit timeout. In order to trigger the TCP Fast Retransmit, three duplicate acknowledgments need to be received. However, when the congestion window is small, enough duplicate acknowledgments may not be produced. The Limited Transmit feature attempts to induce necessary duplicate acknowledgments in such situations. For each of the first two duplicate ACKs, Limited Transmit sends a new data segment (if new data is available). If a previous segment has in fact been lost, these new segments will induce additional duplicate acknowledgments. This improves the chances of initiating Fast Retransmit. Limited Transmit can be used either with or without the TCP selective acknowledgement (SACK) mechanism.
2.4 Large Initial Congestion Window (RFC 3390)

The congestion window is the flow-control imposed by the sending TCP entity. When TCP starts a new connection or re-starts transmission after a long idle period, it starts conservatively by sending a few segments, i.e. the initial congestion window, and does not utilize the whole window advertised by the receiving TCP.
The large initial congestion window (RFC 3390) increases the permitted initial window from one or two segments to four segments or 4380 bytes, whichever is less. For example, when MSS is 1460 bytes, the TCP connection starts with three segments (3*1460=4380). By default, HP-UX uses the large initial congestion window. This is configured by the ndd tunable tcp_cwnd_initial. The large initial congestion window is especially effective for connections that need to send a small quantity of data. For example, to send 4KB of data, it needs just one RTT to transmit the data, while, without the large initial window, it requires an extra round trip time (RTT) which could have significant performance impact on long delay networks.
2.5 TCP Segmentation Offload (TSO)

TCP Segmentation Offload (TSO) refers to a mechanism by which the TCP host stack offloads certain portions of outbound TCP packet processing to the Network Interface Card (NIC). This reduces the host CPU utilization. It
allows the HP-UX transport implementation to create packets up to 32120 bytes in length that can be passed down to the driver in one write. This feature is also referred to as Large Send Offload, Segmentation Offload, Multidata Transmit, or Re-segmentation. TSO increases the efficiency of the HP-UX kernel by allowing 22 segments with the TCP maximum segment size (MSS) of 1460 bytes to be processed at one time, saving 21 passes down the stack. Using a Jumbo Frame MTU of 9000 bytes, this translates to 3 to 4 passes for a 32k byte send.
This feature can significantly reduce the server load for applications that transmit large amounts of data from the system. Examples of such applications include Web Servers, NFS, and file transfer applications.
If the link is primarily used for bulk-data transfers, then turning on this feature improves CPU utilization. The performance gain is less optimal for shorter application messages transmitted over interactive streams. The NIC must be capable of providing this feature. To enable this feature use the following commands: 11i v1 and 11i v2: lanadmin -X send_cko_on ppa lanadmin -X vmtu 32160 ppa 11i v3: nwmgr -s -A tx_cko=on -c interface_name nwmgr -s -A vmtu=32160 -c interface_name If the card is not TSO-capable, the "vmtu" option will not be supported. For more information on TSO-enhanced cards and drivers, go to http://www.hp.com/go/softwaredepot, and search for TSO.
2.6 Packet Trains for IP fragments

Packet Trains are used when sending IP fragments. This solves the problem where driver may not be able to handle a burst of IP fragments. Previously, when processing a large datagram, IP would fragment the datagram to the current MTU size and send down each fragment to the driver before processing the rest of the datagram. This could cause a problem if the driver is unable to process one or more of the IP fragments during outbound processing of these individual fragments.
A single fragment dropped by the driver will cause an entire datagram to be unrecoverable. When the remote machine picks up the remaining fragments, they will be queued in its reassembly queue, according to the IP protocol. If this happens frequently, the entire IP reassembly queue on the receiving side will be exhausted. This, in turn, would result in good packets being dropped because of the full buffer on the receiving side. To mitigate this problem, HP-UX uses Packet Trains. As each fragment is carved off, it is linked with the other fragments for this write to form a packet train, until the entire datagram is processed. Then a request is made to the driver to ensure that all of the fragments can be accommodated in one request. If so, then IP passes down the packet train and the driver sends it to the card. If the driver cannot accommodate the entire packet train, then the entire train is discarded. This reduces the host CPU utilization. This feature is enabled by default, provided that the driver is capable of handling this request. Currently only 1000 Mbit or faster interfaces support this feature. To see if a driver has this feature enabled, enter the following command: # ndd get /dev/ip ip_ill_status If the output includes the keyword TRAIN, the driver supports this feature. For example: ILL rq wq upcnt mxfrg
00001 01500
err memavail ilmcnt name

000 00000000 000001 lan0
00000000517990a8 000000005001e400 000000005001e580 RUNNING BROADCAST CKO MULTICAST CKO_IN TRAIN
Advanced Out of the Box Scalability and Performance Features
The HP-UX Networking Stack has been engineered for best scalability and performance for high end servers. It can gracefully scale up from a few processors to 256 processors, and from 10 BaseT to 10 Gigabit Ethernet. Due to various configuration requirements for different type of workloads on high end servers, HP-UX provides the following advanced performance features for a highly-scalable TCP/IP stack: TOPS NOSYNC Protection from Packet Storm Interrupt Binding
3.1 TOPS
Thread-Optimized Packet Scheduling (TOPS) increases the scalability and performance of TCP and UDP socket applications sharing a high-bandwidth network interface on multiprocessor systems. The goal is to move inbound packets processing to the same processor that runs the receiving user application. IP networking stacks, such as the stack implemented on HP-UX, operate as multiplexers, which route packets between network interface cards (NICs) and a set of user endpoints. HP-UX achieves excellent scalability by scheduling multiple applications across a set of processors; and, for outbound data, applications scale well when sharing a NIC. However, for inbound data, the configuration of each NIC determines which processor it interrupts. For most NICs, a single processor is interrupted as packets come in from the network. In the absence of TOPS, this processor will do the protocol processing for each incoming packet. Since a single high-speed NIC can process incoming data for many connections, the processor interrupted by this NIC can easily become a bottleneck. This prevents the maximum network throughput or packet rate from being realized. In order to improve scalability in this case, the TOPS mechanism allows the driver to quickly hand off packets to the processor where the application is most likely running, and return to processing packets coming from the wire. In most cases, a single processor will then perform all memory accesses to the application data inside each packet. This leads to a more efficient use of memory and cache subsystems. The TOPS mechanism is used by all TCP and UDP sockets without application modification or recompilation. In most cases, an additional benefit of requiring only a single processor to handle application data coming in from the network is realized. This leads to a more efficient use of memory and cache subsystems.
3.1.1 Configuration Scenario for TOPS TOPS is most beneficial for system configurations where the number of CPUs is much greater than the number of NICs such as a 16-way system with one or two Gigabit cards. Inbound packet processing is spread among the CPUs based on where the socket application processes are scheduled, leading to a more even distribution of the processing load in MP-scalable and network-intensive applications. 3.1.2 socket_enable_tops Tunable TOPS is enabled by default on HP-UX 11i, and requires no action on the part of an application to take advantage of this feature. On the more recent patches of 11i v1 and 11i v2, the ndd tunable socket_enable_tops is available to turn off or alter the behavior of TOPS. In 11i v3, the equivalent
tunable will be provided in a future patch. This may be useful in cases described below where specific conditions make the TOPS default less than optimal. Refer to Table 2 (at the end of Appendix B) for the patch level information for the ndd tunable socket_enable_tops. It should not be necessary to disable TOPS. However, there are cases where the scalability issue addressed by TOPS does not exist. When there are multiple NICs on a system, it is possible that no NIC interrupt will become a processing bottleneck even with TOPS disabled (socket_enable_tops = 0). In these cases, there may be some efficiency gained by avoiding the overhead of TOPS, and allowing more of the processing to be done in the NIC interrupt context before switching to the processor running the application. In the most efficient, highest-performing case of the application and NIC being assigned to the same processor, however, there is no need for TOPS to switch processors, and therefore the TOPS tunable setting will have no effect on performance. Another consideration for TOPS tuning is whether the NIC is configured for checksum offload (CKO) on inbound data. If CKO is enabled, TOPS will provide less benefit for the memory cache, as there will not be a need to read the payload data during the inbound TCP/UDP processing. As an application is rescheduled over time between different processors, or in the cases where threads executing on different processors may share a socket, TOPS may not operate optimally in determining which processor to switch to in order to match where the system call will execute to receive the data. In most cases, the default TOPS setting for11i v3 (socket_enable_tops = 2) will work best in following the application to its current CPU. In cases where sockets are being opened and closed at a high rate, it may be possible to gain some efficiency by fixing the processor assigned to each connection by TOPS using the ndd setting socket_enable_tops = 1, which is the default for 11i v1 and 11i v2. However, these cases may be rare, and can only be determined by experimentation, or by detailed measurement and analysis of the performance of the HP-UX kernel. As a result, changing from the default setting to socket_enable_tops = 2 on 11i v1 and 11i v2 will provide equal or better performance in the majority of cases.
3.2 STREAMS NOSYNC Level Synchronization

Previously the STREAMS framework supported execution of only one instance of the put procedure at a time for a given STREAMS queue. For multiple requests to the same queue, STREAMS synchronized the requests depending on the synchronization level of a module. Synchronization ensured that only one request was executed at a time. With high speed I/O, the synchronization limits imposed by STREAMS could easily lead to a performance bottleneck. The restriction imposed by these previous STREAMS synchronization methods has been removed by providing a new synchronization method NOSYNC in 11i v3 and the latest patches for 11i v1 and 11i v2. If a module uses NOSYNC level synchronization, the STREAMS framework can concurrently execute multiple instances of its queue's put procedure and a single instance of the same queue's service procedure. This requires the modules to protect any module-specific data that is shared between multiple instances of put procedures, or between the put and service procedures. 3.2.1 IP NOSYNC synchronization With NOSYNC level synchronization, the IP module can handle requests simultaneously when multiple requests arrive on the same queue. This feature significantly improves network throughput, reaching near link speed for high-speed network interfaces such as multi-port Gigabit cards in an Auto Port Aggregation (APA) configuration or 10Gigabit cards. To realize the performance gain from this feature, all modules (eg. DLPI, IPFilter) on the networking stack between the IP layer and the LAN driver must have NOSYNC enabled. HP recommends that providers of
10
modules pushed on the DLPI stream create or modify the modules to operate at the NOSYNC synchronization level so that the NOSYNC performance gain is not lost. For more details about writing a NOSYNC module/driver refer to: STREAMS/UX Programmer's Guide, available at http://docs.hp.com/en/netcom.html#STREAMS/UX
Patch level information for the NOSYNC feature: 11i v1: STREAMS: PHNE_35453 or higher ARPA Transport: PHNE_35351 or higher DLPI: PHNE_33704 or higher IPFilter: A.03.05.12 or later 11i v2: STREAMS: PHNE_34788 or higher ARPA Transport: PHNE_35765 or higher DLPI: PHNE_33429 or higher IPFilter: A.03.05.12 or later
3.3 Protection from Packet Storms

When the network is overloaded, or when defective network components send out a flood of packets, a server can see an inbound packet storm of network traffic. This would have a serious impact on the performance of a mission critical server as a lot of the CPU power is consumed and an excessive amount of time spent in interrupt context to process these packets. HP-UX has an extensive set of features to minimize the negative effects of many types of packet storms. Using the default capabilities of HP-UX 11i v3, the system will be well protected against this, as described below. 3.3.1 Detect and Strobe Solution The Detect and Strobe kernel functionality is described in The HP-UX 11i v3 Release Notes, Chapter 5 http://docs.hp.com/en/5991-6469/index.html. This feature is designed to limit the amount of processor time spent in interrupt context to a maximum percentage over time. This provides better responsiveness for time-sensitive applications and high-priority kernel threads that could otherwise be delayed by interrupt activity. A tunable parameter, documented in the man page intr_strobe_ics_pct(5) is provided to control the operation of detect and strobe. It is enabled by default, and it is documented that only HP Field Engineers should change the value of this tunable. 3.3.2 HP-UX Networking Responsiveness Features Several features of the networking kernel code contribute to protecting the system from packet storms, and these have been improved in HP-UX 11i v3. In general, synchronization points exist in the protocol layers to serialize the processing of packets when required. For example, to maintain the state of a particular TCP connection, inbound and outbound segments are processed serially as they are received by the upper or lower protocol layers, and queuing can occur. The queued backlog of packets could become a responsiveness issue, particularly when processed in an interrupt context. However, by setting some reasonable limits to the queue length, and
11
eliminating points of contention to allow more parallelism in TCP/IP processing, HP-UX has eliminated many causes of delay in the kernel, even when the system is under extreme load. In addition, the Detect and Strobe feature will be activated if the incoming traffic is more than the system can handle. 3.3.3 Responsiveness Tuning The cost of providing responsiveness for the overall system in the case of packet storms is that incoming network interrupts can be delayed or even dropped. This will usually occur in a case where the incoming packets would be eventually dropped anyway due to a kernel queue overflow, memory shortage, or network protocol timeout. Given that dropping packets is inevitable, dropping them as soon as possible in the NIC uses fewer operating system resources, and is therefore the most desirable response. The latter is particularly true in the case of packet storms consisting of unwanted packets, for example from a malfunctioning switch, where the loss of the packets themselves is of little or no consequence. In the case of reliable protocols such as TCP, the dropped data should be recovered through a retransmission, and the protocol should help relieve the overflow by slowing down the connection using the TCP congestion window. In the case of an unreliable datagram protocol such as UDP, the loss of data may be noticeable at the user or application level. The logging messages described in intr_strobe_ics_pct(5) can be used to determine when the Detect and Strobe feature has been activated due to excessive interrupt activity. In addition, HP Support can retrieve network-specific kernel statistics that can determine if packets are being dropped due to queue overflows. If responsiveness is not critical on the system, it may be possible to gain a small amount of performance by tuning intr_strobe_ics_pct(5) to allow a higher maximum percentage of interrupt processing. Other approaches to increasing responsiveness and performance include using Interrupt Migration as described in section 3.4 and binding critical applications to processors or processor sets where interrupt activity is less likely to be a problem. Many of the features described above are available in HP-UX 11i v1 and 11i v2 at the most recent patch levels. HP-UX Reference HP-UX 11i v2 September 2004, describes intr_strobe_ics_pct(5), which is disabled by default in this version. In September 2006, an 11i v2 Detect and Strobe was released based on the May 2005 update release, with some additional recommended patches. A similar responsiveness solution was released for HP-UX 11i v1 as a set of patches and optional products. The Interrupt Migration product for 11i v1 is one of these optional products, which is available without cost from software.hp.com. Because of the set of patches required, and the recommendation for the HP Field to modify intr_strobe_ics_pct(5), HP Support should be contacted if a responsiveness solution is required for all HP-UX 11i releases.
3.4 Interrupt Binding and Migration

In HP-UX 11i, the system administrator has the ability to assign interface cards to interrupt specific processors, overriding the default assignment performed by the operating system. The command used for this assignment, called Interrupt Migration, is intctl(1M). The default assignments done by the operating system at boot time will spread interrupts evenly across a set of processors, and will work well in most cases. However, for optimal network performance, it may be necessary to change this, taking into consideration the overall system and application workload.
12
3.4.1 Configuration Scenario for Interrupt Migration A significant amount of network protocol processing for inbound packets is done as part of the interrupt from the network interface. In order to avoid a CPU bottleneck when there is heavy network traffic, Interrupt Migration can be used to move interrupts away from heavily-loaded processors. Examples of this load balancing could be to configure two busy network interfaces to interrupt separate processors, or to schedule network interrupts away from a processor which is busy with unrelated application processing. In the case of an IP subnet configured using Auto Port Aggregation (APA); maximum throughput can be achieved by assigning interrupts for each interface in the aggregate to a separate processor. The 10 Gigabit Ethernet driver (ixgbe) for HP-UX provides load balancing through the destination-port based multiqueue feature. This allows multiple processors to be interrupted by the 10 Gigabit card, and the incoming traffic can be separated into multiple flows based on the TCP destination port. Only TCP is supported by the destination port multiqueue feature. This increases the maximum throughput of the 10 Gigabit card, which would otherwise be limited by the interrupt processing speed of a single CPU. The "10GigEthr-00 (ixgbe) 10 Gigabit Ethernet Driver" release notes (http://docs.hp.com/en/J637990003/J6379-90003.pdf) explains the configuration of the multiqueue feature.
3.4.2 Cache Affinity Improvement Network protocols are layered, and data and control structures are shared between these layers. When these structures are brought into a processor's cache, less time is spent stalling for cache misses as the remaining protocol layers process the packet. Since interrupts for a NIC are bound to a processor, there is even a good possibility that some structures will still be in the correct processor's cache when the next packet for a given connection arrives. However, when an application receives the data, there is the possibility of additional cache misses, as the HP-UX scheduler assigns application threads to processors independently of the interrupt bindings. To get the most efficient operation from a cache standpoint, it is beneficial to have the interrupt assigned where the busiest applications are consuming the data. Using mpctl(2) on a per-application basis, and optionally defining processor sets, applications can be restricted to run on specific processors. If this does not result in a CPU bottleneck, then it is most efficient both for the application and from a system wide perspective. On the other hand, there is little cache sharing between network interfaces, so there will be little benefit from cache affinity if multiple network interfaces interrupt the same processor.
13
Improving HP-UX Server Performance
4.1 Tuning Application and Database Servers

Many of the enterprise applications today are architected and built using the J2EE framework, which is designed for the mainframe-scale computing typical of large enterprises. The J2EE framework provides a way to architect solutions which are distributed, multi-tiered and scalable. The diagram below shows an overview of the multi-tiered J2EE application architecture.
Tier 1
Enterprise Data Center

Tier 2 Tier 3
DB Server
Web Server
App Servers
Internet
Internet
Web Clients
In such an architecture, the client tier typically consists of web browsers or traditional terminals used at point of sales etc. In a typical deployment, web and business tiers are either separate or may be hosted within a single physical server. Application servers normally run business logic of an enterprise and they communicate with backend database servers using application programming interfaces such as the Java Database Connectivity (JDBC) interfaces. Though both web servers and application servers can be hosted on a single physical server, the often used practice is to separate them and run them on different physical servers for better performance and scalability of applications. In an actual deployment, there may be components such as a network load balancer which will help balance the load among multiple application servers and/or web servers. The diagram on the next page shows a typical physical view of such a deployment.
14
Enterprise Data Center
Firewall Load balancer App Server Web Server DB Server
Internet/Leased lines
http https
Open Zone
DMZ
MZ
4.1.1 Tuning Application Servers

Network traffic characteristics of a physical server which is used to run as an application server varies based on its usage context and the nature of applications (business logic) that they run. Web applications are normally implemented using technologies such as servlets and JSP scripts. For example, users first connect to the Web server which in turn forwards the request to run an application. Such an application may be implemented as a servlet on an application server. Based on the application logic, the application server may need to access the back-end database server. Typically, application servers communicate with front-end web servers or back-end database servers using a shared set of TCP connections an approach known as connection pooling. Typically, Web servers reuse these connections for forwarding the requests from different clients at different point in time. The connection pooling approach is preferred over creation of new connections when needed for performance reasons. The number of TCP connections in the pool is often configurable and is based on the number of concurrent users that the system needs to support during peak load conditions. Usually application server vendors may suggest a set of networking related tunable parameters that are best suited to run the application server on a given OS platform. In this section we provide a set of guidelines on tuning network tunable parameters to run application servers on HP-UX 11i. Most of the tunable parameters discussed below are queried or set using ndd command on HP-UX. Please refer to Appendix B for more details on these tunable parameters.
4.1.1.1 tcp_time_wait_interval A physical server has to support large number of concurrent TCP connections if it is used to run both an application server and a Web server simultaneously. tcp_time_wait_interval controls how long connections need to be in the TIME_WAIT state before closing down. Often opening and closing of a large
15
number of TCP connections, as in the case with Web servers, may result in a large number of connections staying in TIME_WAIT state before getting closed. Application server vendors may typically suggest tuning this parameter related to TCPs TIME_WAIT timer. With the default value of 60 seconds for tcp_time_wait_interval on HP-UX, the HP-UX stack can track literally millions of TIME_WAIT connections with no particular decrease in performance and only a slight cost in terms of memory. Please refer to Appendix B for further discussion on this tunable parameter. 4.1.1.2 tcp_conn_request_max Depending upon the configuration of a physical server, application servers typically need to accept a large number of concurrent connections. The number of connections that can be queued by the networking stack is the minimum of the listen backlog and this tunable parameter tcp_conn_request_max. Application server vendors may suggest this tunable parameter to be set to 4096. On HP-UX the default value for this tunable is 4096 so it may not need a change. Use netstat p tcp to monitor any dropped connections due to listen queue full conditions and increase the value of this tunable parameter if necessary. Refer to section 4.2.4 for a detailed description of this tunable parameter. 4.1.1.3 tcp_xmit_hiwater_def This parameter controls the amount of unsent data that triggers the write side flow control. For typical OLTP types of transactions (short request and short response) this parameter needs no tuning. Increasing this tunable enables large buffer writes. For Decision Support System (DSS) workloads (i.e. small query and large response), we recommend setting this tunable parameter to 65536 (default is 32768). Please refer to Appendix B for further discussion on this tunable parameter.
4.1.1.4 tcp_ip_abort_interval In certain deployment scenarios, backend database servers may be used in a highly available cluster configuration. tcp_ip_abort_interval is the maximum amount of time a sending tcp will wait before concluding the receiver is not reachable. Application servers may use this mechanism to detect node or link failure and automatically switch the traffic over to a working database server in a cluster configuration. In a typical deployment, application servers may be communicating with database servers and Web servers which are physically close. To help faster detection and making use of fail-over features in such environment, it may be desirable to set this tunable parameter to a lower value than the default value of 10 minutes. However, it is not recommended to set this parameter lower than tcp_time_wait_interval. Please refer to Appendix B for further discussion on this tunable parameter.
4.1.1.5 tcp_keepalive_interval When there is no activity on a connection and if the application requests that the keepalive timer be enabled on the connection, then TCP sends keepalive probes at tcp_keepalive_interval intervals to make sure that the remote host is still reachable and is responding. Application servers may make use of this feature (SO_KEEPALIVE) to quickly fail over in cluster configurations when there is not much network traffic. As application servers typically maintain a pool of long standing TCP connections open with both Web servers and database servers, it is desirable to detect and fail over earlier in case of node or link failures during very low network traffic. The default value is 2 hours; however some application server vendors suggested tuning this tunable parameter to a much lower value (e.g. 900 seconds) than the default value.
16
4.1.1.6 tcphashsz tcphashsz controls the size of several hash tables maintained within the kernel. For better performance it is better to have larger tables at the expense of more memory being used when there is a large number of concurrent connections in the system. On modern-day servers memory may not be a major constraint. When Web server and application servers are run on the same physical machine, the suggested value for this tunable parameter is 32768. If Web server and application servers are running on different machines, then the number of concurrent connections on the application server may not be very large. In this case the default value (number of CPUs * 1024) should suffice. This parameter is set using the following command: # kctune tcphashsz=32768 Please note that system has to be rebooted for the new value to take effect. Otherwise system will continue to use the current value of the tcphashsz parameter. Refer to section 4.2.3 on page 21 for more discussion on tuning tcphashsz.
4.1.2 Tuning Database Servers

There are several different database systems that are deployed today on HP-UX. Typically networking may not be a bottleneck on a database server as compared to I/O. Nevertheless, the following tuning may help improve the overall efficiency from a networking perspective. 4.1.2.1 tcp_xmit_hiwater_def This parameter controls the amount of unsent data that can be queued to the connection before subsequent attempts by the application to send data will cause the call to block or return EWOULDBLOCK/EAGAIN if the socket is marked non-blocking. For typical OLTP types of transactions (short requests and short responses) this parameter needs no tuning. However, you may consider increasing it to 65536 from the default value 32768 for DSS (Decision Support System) or BI (Business Intelligence) workloads that require a large amount of data to be transferred from the database server. Furthermore, this may help in data backups from database servers to an external storage device over a network-attached storage (NAS).
4.1.2.2 socket_udp_rcvbuf_default Cluster based database technologies are becoming popular. Typically nodes of such database clusters communicate among themselves using UDP. A large amount of data may get exchanged between server nodes in a database cluster connected through an interconnect. In this case you may want to consider increasing the tunable parameter socket_udp_rcvbuf_default which defines the default receive buffer size for UDP sockets. If the command netstat p udp shows socket overflows, it might be desirable to increase this tunable parameter. It should be noted that increasing the size of the socket buffer only helps if the length of the overload condition is short and the burst of traffic is less than the size of the socket buffer. Increasing the socket buffer size will not help if the overload is sustained.
4.1.2.3 socket_udp_sndbuf_default As mentioned above, cluster based database technologies may use UDP to communicate between nodes in the cluster. This tunable parameter sets the default send buffer size for UDP sockets. The default value for this tunable parameter is 65536, which is optimal for cluster based database technologies used on HP-UX.
17
4.2 Tuning Web Servers

As the demand for faster and more scalable web service increases, it is desirable to improve web server performance and scalability by integrating web server functionality into operating systems. Web servers have characteristics of many short-lived connections, which open and close TCP connections at a very fast rate. In a busy web server environment, there could be tens of thousands of TCP connections per second. The following features and configurations are recommended for optimizing Web servers: Network Server Accelerator HTTP Socket Caching for TCP connections Increase the tcphashsz value Increase the listen queue limit MSG_EOF for TCP applications
4.2.1 Network Server Accelerator HTTP

The Network Server Accelerator HTTP (NSA HTTP) is a product that provides an in-kernel cache of web pages in HP-UX 11i. This section describes the performance improvements achievable with NSA HTTP and the system tuning needed to achieve these performance improvements. The following list highlights the techniques NSA HTTP implements to achieve superior performance of Web servers in HP-UX 11i: Serving content from RAM (main memory) eliminates disk latency and I/O bandwidth constraints. In-kernel implementation decreases transitions between kernel and user mode. Tight integration with the TCP protocol stack. This allows efficient event notification and data transfer. In particular, a zero-copy send interface reduces data transfer by allowing responses to be sent directly from the RAM-based cache. Deferred-interrupt context processing removes the overhead associated with threads. Re-use of data structures and other resources reduces the lengths of critical code paths.
The HTTP-specific portion of NSA HTTP is implemented as a DLKM module. In addition, the nsahttp utility is provided to configure and administer NSA HTTP. For a detailed description of the utility, refer to the nsahttp(1) man page. The NSA HTTP product is supported on HP-UX 11i and is available from http://www.software.hp.com. 4.2.1.1 Usage Scenarios There are a number of ways NSA HTTP can be used in a Web server environment. We briefly describe two scenarios to highlight the most typical usage. 4.2.1.1.1 Single System Web Server with NSA HTTP The simplest scenario uses NSA HTTP and a conventional user-level Web-server process co-located on a single system. In this topology, NSA HTTP increases server capacity by increasing the efficiency for processing static requests. NSA HTTP provides a fast path in the kernel that bypasses normal processing of static requests at the user level. This fast path entails having NSA HTTP parse each HTTP request to determine whether it can be served from the kernel. Requests that cannot be served from the kernel are passed to the user-level server process. Adding the fast path in the kernel, therefore, introduces additional parsing and processing to the path for requests served at the user level. This overhead is not significant and is more than compensated for by the increased efficiency when serving static requests.
18
4.2.1.1.2 Multiple Web Servers with Partitioned Content High-traffic Web sites typically feature multiple servers that are dedicated for specific purposes. A given set of servers, for example, may serve specific content such as images, advertisements, audio, or video. Dedicating servers to specific content types limits the total working set that must be delivered by any single server and allows the server's hardware configuration to be tailored to its content. One common approach to partitioning content is to separate static and dynamic content. Servicing static content requests typically requires more I/O bandwidth and memory than servicing dynamic content; servicing dynamic content requests typically requires greater CPU capacity. A second typical usage scenario, usually associated with very-high-traffic Web sites, is to deploy NSA HTTP on multiple web servers dedicated to serving static content, with a load balancer and/or web switch that routes user requests to the appropriate server. This approach is typically viable when the content of a site has already been manually partitioned among a set of specialized servers. 4.2.1.2 Tuning Recommendations This section describes the NSA HTTP operating parameters that you can tune to improve performance. 4.2.1.2.1 Maximum NSA HTTP Cache Percentage (max cache percentage) The maximum NSA HTTP URI (Uniform Resource Identifier, a term that encompasses URL) cache size is configured as a percentage of system memory. You can set the value for this parameter by editing /etc/rc.config.d/nsahttpconf or by using the nsahttp(1) command: # nsahttp -C max_cache_percentage By default, max_cache_percentage is 50 (50% of system memory). You should set the value for max_cache_percentage in conjunction with the system file cache settings (see filecache_max(5) for 11i v3, and dbc_min_pct(5) for 11i v1 and 11i v2). See sendfile() in section 5.1 of this document for additional information on file cache settings, as sendfile caching is done directly in the file cache, separately from caching done by NSA HTTP. 4.2.1.2.2 Cache Entry Timeout NSA HTTP has a URI cache entry timeout value. If an entry is not accessed for a period longer than the timeout value, NSA may re-use (write over) the entry. For best performance, an optimal value for the timeout value must be found. If it is too high, the cache may contain many stale entries. If it is too low, there may be excessive cache entry timeouts and increased cache misses. You can set the cache timeout by editing /etc/rc.config.d/nsahttpconf or by using the following nsahttp command: # nsahttp -e cache_timeout The cache_timeout value is set in seconds. For example, the command nsahttp -e 7200 sets the cache entry timeout to 7200 seconds (two hours). 4.2.1.2.3 Maximum URI Page Size NSA HTTP allows you to limit the maximum size of each of the URI objects (web pages) stored in the cache. You can tune this value to optimize the cache usage. You can set the maximum URI page size by editing /etc/rc.config.d/nsahttpconf or by using with the following nsahttp command: # nsahttp -m max_uri_page_size
19
The max_uri_page_size is specified in bytes. For example, the command nsahttp -m 2097152 causes NSA HTTP to cache only web pages that contain 2MB or fewer. 4.2.1.3 Performance Data A simulated web server environment was used to measure the performance of NSA HTTP. The workload was a mix of static content (70%) and dynamic content (30%). The measurements were taken using Web servers that implement copy avoidance when servicing static requests. The performance improvement was about 13-17%. On workloads with only static content, the performance improvement was approximately 60-70%. The performance improvements can be significantly greater when NSA HTTP is used with web servers that do not implement copy avoidance for servicing static requests.
4.2.2 Socket Caching for TCP Connections

There is a finite amount of operating system overhead in opening and closing a TCP connection (for example in the processing of the socket(), accept() and close() system calls) that exists regardless of any data transfer over the lifetime of the connection. For long-lived connections, the cost of opening and closing a TCP connection is not significant when amortized over the life of the connection. However, for a short-lived connection, the overhead of opening and closing a connection can have a significant performance impact. HP-UX 11i implements a socket caching feature for better performance of short-lived connections such as web connections. A considerable amount of kernel resources (such as TCP and IP level data structures, STREAMS data structures) are allocated for each new TCP endpoint. By avoiding the allocation of these resources each time an application opens a socket, or receives a socket with a new connection through the accept() system call, a server can proceed more quickly to the data transfer phase of the connection. The socket caching feature for TCP connections saves the endpoint resources instead of freeing them, speeding up the closing function. Once the cache is populated, new TCP connections can use cached resources, speeding up the opening of a connection. TCP endpoint resources cached due to closing of one TCP connection can be reused to open a new TCP connection by any application. HP-UX 11i v3 has been enhanced to cache both IPv4 and IPv6 TCP connections. HP-UX 11i v1 and HP-UX 11i v2 support caching of IPv4 TCP connections only. HP-UX does not currently cache other transport protocols such as UDP. 4.2.2.1 Tuning Recommendation Socket caching is enabled by default for IPv4 and IPv6 TCP connections. The default value of number of TCP endpoint resources that are cached is 512 per processor. The number of cached elements (TCP endpoint resources) can be changed by using the ndd tunable socket_caching_tcp. For example to set the number of cached elements to 1024 one would do the following: # ndd -set /dev/sockets socket_caching_tcp 1024
20
socket_caching_tcp tunable controls both IPv4 and IPv6 TCP connections. Please refer to the ndd help text for more information. The ndd help text for socket_caching_tcp may be obtained by executing the following command: # ndd -h socket_caching_tcp The number of elements to be cached for optimal performance depends upon the frequency of open/close and on how the number of simultaneous connections changes over time. In most cases, the default value of 512 is acceptable, but if your application workload is such that large bursts of connections are opened and closed quickly, then increasing the number of elements to be cached will improve the performance. 4.2.2.2 Performance Data The publicly available netperf performance measurement tool can be used to measure the performance improvement of the socket caching feature. In the netperf TCP_CRR test, which simulates short-lived connections, the typical performance improvement is in the range of 15% - 20% when the socket caching feature is enabled with the default number of elements (512) to be cached.
4.2.3 Tuning tcphashsz

The tcphashsz configuration variable specifies the size of the networking hash tables. A system such as a Web server that constantly has a large number of TCP connections may benefit from increasing this value. There are several dynamically allocated hash lists in TCP and IP. The sizes of these lists are controlled by the HP-UX configuration variable tcphashsz. Since this single variable controls the size of multiple lists, it is treated as a relative scale factor, rather than as the actual size of any specific list. The relative sizes of the individual lists are fixed. The absolute sizes are determined by the product of the scale factor and the relative size. For Web servers handling large number of TCP connections it is beneficial to increase this value in order to avoid large hash chains. In addition, increasing the tcphashsz on systems with high number of CPUs will improve scalability and avoid cache misses. The following simplified rule can be applied to tune the tcphashsz: tcphashsz = (number of CPUs * 1024) However, the tcphashsz obtained from the above rule can be increased if the system workload is higher than expected. The larger the number of hash buckets, better the efficiency. Having a larger number of hash buckets in a hash table minimizes the chance of multiple connections of having their data structures placed in the same bucket. More memory is used when this value is raised because it increases the size of internal hash tables, but this should be an acceptable tradeoff since the memory increase is small compared to the benefit in improved performance. The size of hash tables used in the kernel by network protocols is changed by modifying the system parameter tcphashsz using kctune(1M). The changes to the value of tcphashsz will take place only after a reboot of the system. The tunable tcphashsz must be a power of two. If it is not specified as a power of two, the system rounds it down to the nearest power of two. Prior to the 11iv3 0803 release, the minimum value is 256, the maximum value is 65536, and the default is 2048. Starting with the 11iv3 0803 (PHNE_36281) release, the tunable tcphashsz is now made auto-tunable so that the system can decide the optimal value of tcphashsz at boot time. The default value of tcphashsz
21
has been changed to 0. A value of 0 (default) will auto-tune the tcphashsz in proportion to the number of cores in the system during the system bootup. The minimum value is 0, and the maximum is 65536. If this tunable is set in the range of 1 to 255, then it will be increased to 256. It is recommended that the value of tcphashsz be set to 0 so that the system can choose an optimal tcphashsz value.
4.2.4 Tuning the listen queue limit

The listen queue limit can affect Web server performance. The following sections discuss the listen queue and the listen backlog.
4.2.4.1 The Listen Queue An important concern for Web servers is to ensure that the server is able to handle the load placed on it by clients. The rate at which clients connect to servers can vary considerably. Under low loads the server will be able to service the incoming connection requests without any delay. However, as the incoming connection load increases, the Web server will not be able to handle the connections as quickly as they arrive. This is a very common scenario for any type of server. A buffer to store the incoming connection requests is provided so as to not drop any incoming connection request that cannot be processed immediately by the server. This "buffer" is normally referred to as the "listen queue". When the server establishes the listening TCP endpoint, it specifies an upper limit for the number of established connections that will be allowed to wait in the listen queue. This limit is variously known by terms such as "listen backlog", "listen backlog limit", "backlog", "listen queue limit". Client connection attempts in excess of this limit will be ignored. Clients will typically retry for a while, and then give up, often reporting the failed attempt as a "connection timeout". Note that client connection attempts which do not time out only get into the listen queue. They may still wait for a long time to receive service from the application. 4.2.4.2 The Listen Backlog and Web Server Performance Choosing a good backlog value is a matter of balancing web server behavior and system performance against the expected client load and the consequences of clients being unable to connect. For commercial Web sites, such as web-based retail businesses or Web sites that rely on advertising revenue, clients unable to connect often represent lost business. For Web sites providing services, inability to connect can be seen as poor service. In such examples it is highly desirable to avoid having clients which are unable to connect. The listen backlog is only one of a number of factors that affects the system's ability to service client requests. Generally, more significant factors will be things such as Web server configuration, the type of services being accessed by clients (e.g. static versus dynamic content, encrypted versus unencrypted connections, etc.), and simply the system's raw computing and I/O capacity. Tools such as glance(1), top(1) and sar(1M) on HP-UX 11i may be used to determine whether the system is operating at its physical capacity. If you determine that the system is not operating at capacity, and yet clients are being turned away, it is reasonable to consider whether you need to increase the listen queue limit.
22
On the other hand, if the system is found to be operating at full capacity, and yet clients are not being turned away, and if a high proportion of those clients experience long waiting times before service, then one may consider one of the following courses of action: Limit the number of requests that are accepted into the listen queue, and cause the excess requests to be turned away. Upgrade the system hardware to provide extra capacity. Replace or modify the application to enable it to process requests more efficiently. Usually it will be desirable in such cases to upgrade the system or application, but here we consider only the case where the administrator chooses to address the problem by limiting the load on the system. 4.2.4.3 Tuning the Listen Backlog The limit on the listen backlog is primarily determined by the application program. Thus the application documentation (in this case, the Web server documentation) should be consulted for information on how the listen backlog can be configured for that program. However, HP-UX 11i provides the ability to control the maximum backlog value that applications are allowed to select for themselves. This parameter is the ndd tunable tcp_conn_request_max, which by default has a value of 4096. If connection requests are being dropped and the application is using a backlog limit larger than 4096, then the system administrator can configure a larger listen queue by increasing the value of the ndd tunable tcp_conn_request_max. Or, if the system is overloaded and the application does not permit configuration of a lower backlog limit, then the value of the ndd tunable tcp_conn_request_max may be set to a lower value than the one used by the application, and it will lower the backlog limit that the application will be able to use. Note that while changes to the ndd tunable tcp_conn_request_max take effect in the system immediately, the value is only applied to listening sockets created after the value is changed. Therefore the Web server or other applications affected should be shut down and restarted after making changes to this parameter. 4.2.4.4 Monitoring Listen Backlog Usage The system administrator may use the TCP statistics reported by netstat -s or netstat -p tcp to decide whether the system may benefit from changing the backlog. ... 447144 connection accepts ... 21118 connect requests dropped due to full queue From this output one learns the number of connections that have been accepted by all "server" applications (including the Web server). Monitoring the growth of this count will give an indication of the rate at which client requests are being serviced. The other thing one learns from this is the number of connections dropped due to full queue. This shows how many connection attempts were received at a time when the target listen queue was already full. The rate at which this number grows will indicate the number of client connection attempts which are failing. If this number is growing at a substantial rate, and if examination of the overall system workload shows that the system has additional available processing capacity, then the Web server can benefit from increasing the listen backlog -- first by seeing if the web server provides a way to configure this, and then by looking at the value of the ndd tunable tcp_conn_request_max.
23
Note however that the depth of the listen queue and the rate of dropped requests represent the balance between the rate at which the requests arrive and the rate at which the server is able to accept connections. For information on implementation of server programs to make effective use of the listen backlog, please refer to section 5.4.
4.2.5 Using MSG_EOF flag for TCP Applications

The MSG_EOF feature improves TCP application performance by piggybacking the FIN segment on the last data. The purpose of the MSG_EOF flag is for applications to notify the End-Of-File. When the MSG_EOF flag is used in send() system call, it initiates a write-side shutdown along with the send operation. Use of the MSG_EOF flag is semantically equivalent to a send() followed immediately (if the send is successful) by a shutdown(s, SHUT_WR). If the data send operation fails then the shutdown operation is not performed. Once this flag is used in a successful send() call, no further data may be sent on the socket. For TCP client-server transactions, FIN segments are typically sent separately from the data. By using the MSG_EOF flag, the FIN is piggybacked on the last data segment, reducing the number of segments exchanged between peers. For example, in a typical TCP transaction, a client does connect(), send(), receive(), and close() while a server does accept(), receive(), send(), and close(). Typically, it exchanges eight segments without using the MSG_EOF, and five segments with using the MSG_EOF. Not only the use of the MSG_EOF flag reduces the number of packets over networks, it also reduces the processing time on both sides of TCP connections. Here is an example of the MSG_EOF usage in the send() system call:
TCP client: s = socket(af, socktype, proto); connect(s, addr, addrlen); send(s, buf, buflen, MSG_EOF); receive(s, rbuf, rbuflen, 0); close(s); TCP server: s = socket(af, socktype, proto); listen(s, 4096); accept(s, addr, addrlen); receive(s, rbuf, rbuflen, 0); send(s, buf, buflen, MSG_EOF); close(s);
/* send a request with MSG_EOF */ /* receive a response */
/* receive a request */ /* send a response with MSG_EOF */
The MSG_EOF feature is available in the 11iv3 0803 (PHNE_36281) release. To enable the MSG_EOF feature, set the ndd tunable socket_msgeof to 1. The default of socket_msgeof is 0 (off).
24
4.3 Tuning Servers in Wireless Networks

Cellular mobile wireless networks have characteristics of a long latency, large bandwidth-delay product, and volatile delays. Because cellular wireless networks have a long latency of a few hundreds of milliseconds to a few seconds, their bandwidth-delay product is large, especially for the 3G wireless and beyond. Therefore, the TCP window size needs to be set sufficiently large in a wireless network environment. For efficient TCP communications over wireless networks, HP-UX provides the following features: TCP Window Size and Window Scale Option Large Initial Congestion Window Limited Transmit SACK Smoothed RTO Algorithm F-RTO
This section describes Smoothed RTO Algorithm and F-RTO and their configuration. Refer to Chapter 2 for the other four TCP performance features as they are covered under the Out of the box performance features. These features are available in HP-UX 11i v3. Many of these new features were delivered in ARPA Transport patches for 11i v1 and v2. Refer to Table 2 (at the end of Appendix B) for more information. 4.3.1 Smoothed RTO Algorithm The round trip times (RTT) of cellular mobile wireless networks are highly variable, and therefore they are prone to produce spurious retransmission timeouts. HP-UX has an enhanced TCP Retransmit Timeout (RTO) algorithm so that it adapts quickly even in the volatile environment of wireless networks. As a result, it provides a much better probability of avoiding costly spurious retransmissions without sacrificing the responsiveness to real packet loss. Besides that, the initial value and the lower limit for the RTO timeout are configurable. The initial value for the RTO timeout is configured by the ndd tunable tcp_rexmit_interval_initial. When TCP first sends a segment to the remote, it has no history about the round-trip times. So, it will use tcp_rexmit_interval_initial as a best guess for its first retransmit timeout setting. Setting this value too low will result in spurious retransmissions and could result in poor initial connection performance. If this value is set too high, TCP may not be able to send many retransmissions before it reaches the tcp_ip_abort_interval or tcp_ip_abort_cinterval and could result in a spurious connection abort. The lower limit for the RTO timeout is configured by the ndd tunable tcp_rexmit_interval_min. Setting this value too low could result in spurious TCP retransmissions, which decrease throughput by keeping the TCP congestion window artificially small. 4.3.2 Forward-Retransmission Timeout (F-RTO) Cellular mobile wireless networks have volatile behavior in the round trip times (RTT). Due to handovers and interferences such as tunnels, they often exhibit spikes that are several times larger than a typical RTT. Some spurious retransmission timeouts are inevitable. The F-RTO algorithm is an HP-UX enhancement that can help TCP handle spikes in RTTs. Spurious retransmission timeouts are costly because they incur unnecessary retransmissions and also keep the congestion window (cwnd) small. By detecting spurious timeouts with the F-RTO, HP-UX effectively
25
avoids additional unnecessary retransmissions and accelerates the recovery of the cwnd that is shrunk to one segment by the timeout. F-RTO is disabled by default. To enable F-RTO, the following ndd tunable is provided: tcp_frto_enable The valid values for tcp_frto_enable are: 0 The local system does not use F-RTO. This is the default value. 1 The local system uses the F-RTO algorithm for detecting and responding to spurious timeout. 2 The local system uses the F-RTO algorithm for detecting spurious timeouts. The response algorithm is based on TCP Congestion Window Validation. When a retransmission timeout (spurious or not) occurs, the cwnd is shrunk to one segment, and the ssthresh is also shrunk to one-half of the amount of unacknowledged data in the network, i.e to one-half of the flight size. After the F-RTO detects a spurious timeout, both the option 1 and the option 2 avoid unnecessary retransmissions. Additionally, they accelerate the recovery of the cwnd to its previous value. Option 1, the original F-RTO response algorithm, restores the cwnd to one-half of the previous flight size, i.e. the flight size before the timeout. Therefore, the recovery is faster than the conventional RTO recovery that starts with the cwnd set to one. Option 2, an algorithm based on Congestion Window Validation (RFC 2681) restores both the ssthresh and the cwnd. The congestion window validation algorithm updates the ssthresh and the cwnd by: ssthresh = win = cwnd = max(ssthresh, 3*cwnd/4) min(cwnd, receiver's declared max window) max(win/2, MSS)
The option 2 deploys the same approach, and it restores the ssthresh to the maximum of the previous ssthresh or 3/4 of the previous flight size, and restores the cwnd to one-half of the previous flight size. This is more aggressive in recovery than option 1 because the ssthresh is also restored back close to the previous level, but it is still conservative because it does not restore the cwnd to the full flight size. Instead it reduces the ssthresh and the cwnd as if the congestion window validation were exercised.
26
Tuning Applications Using Programmatic Interfaces
5.1 sendfile()
The sendfile() system call allows the contents of a file to be transmitted directly over a TCP connection, without the need to copy data to and from the calling application's buffers. This provides zero copy feature for sending data to the remote side. Refer to the sendfile(2) manpage for details on syntax and usage of sendfile. In HP-UX 11i v3, sendfile() has been updated to use the Unified File Cache (UFC), which provides kernel access to file data. There is no externally visible change related to this. However, kernel tunables on 11i v1 and v2 used for tuning the buffer cache and sendfile memory usage will no longer have an effect on HP-UX 11i v3 systems. Obsolete buffer cache tunables include bufcache_max_pct, bufpages, dbc_min_pct, dbc_max_pct, and nbuf. The sendfile_max tunable is also obsolete. New tunables are documented in filecache_max(5), filecache_min(5), fcache_seqlimit_file(5), and fcache_seqlimit_system(5). Refer to the manpages in the HP-UX Reference manual and the HP-UX 11i v3 Release Notes, in Chapter 6 under HP-UX File Systems Architecture Enhancements (http://docs.hp.com/en/5991-6469/index.html) for more documentation of these tunables. sendfile() uses the file cache with sequential access for transmitting file data. Therefore, the performance of sendfile() can be affected by how much of a given file can be cached based on the size of the file cache (filecache_min/filecache_max) and the amount available for a file using sequential access (fcache_seqlimit_file). If the entire file can be kept in the file cache between sendfile calls, the second sendfile call will be able to operate at the speed of memory instead of being delayed by disk I/O. Conversely, in sizing the file cache, the performance of other applications should also be taken into account. If a large file monopolizes the file cache after a sendfile() call, other files may be flushed from the file cache, slowing non-sendfile file accesses. The default tunable values should be optimal for most purposes. In the case where sendfile_max had previously been changed from the default to a lower value to limit the amount of buffer cache available to sendfile(), you may want to consider tuning fcache_seqlimit_file and/or fcache_seqlimit_system to a lower value. In the case where sendfile_max had previously been increased from the default value, and the size of a file(s) transmitted with sendfile() is close to the size of the file cache or the size of physical memory, you may want to increase value of filecache_max. If this can be done without causing a shortage of memory elsewhere in the system, it could improve the performance of sendfile() by allowing the entire file to reside in the cache.
5.2 Polling Events

To monitor I/O conditions on multiple file descriptors, select()and poll() system calls are typically used. Though poll() is preferred over select() as it allows various types of event to be monitored and is more efficient compared to select(), using poll() can still affect performance when a large number of file descriptors are monitored. On HP-UX there is another mechanism called /dev/poll (event port) that can be used to monitor the I/O events on a large number of file descriptors. In this mechanism, an application opens the
27
/dev/poll driver and registers a set of file descriptors that it wants to monitor along with the set of events it wants to monitor for those file descriptors. Then it can issue an DP_POLL ioctl() call on the event port driver to check which events have occurred on the registered file descriptors. The DP_POLL ioctl() on return specifies the file descriptors (if any) that have events pending and which events have occurred. /dev/poll usually performs better than select() and poll() when the application has registered a very large number of file descriptors that have specified events occurring sparsely. However, in cases where specified events are likely to occur simultaneously on a large number of registered file descriptors, performance gain may be diminished. Also, /dev/poll is better suited for applications that do not open and close the polled file descriptors very often as the cost of registering the polled file descriptors may become more than the gain achieved from polling them. See poll(7) for more details on event port mechanism and the /dev/poll interface.
5.3 send() and recv() Socket Buffers

When writing applications that transmit or receive large quantities of data, it may be important to consider the values of socket parameters often referred to as the "socket buffer sizes", namely the SO_SNDBUF and SO_RCVBUF socket options. These options can be set and examined using the setsockopt() and getsockopt() system calls. The precise effect of these parameters varies between implementations of the socket API, and between types of socket. The behavior for SOCK_DGRAM is fairly standard, and is not discussed here. For SOCK_STREAM, there are some aspects of the behavior which are not clearly delineated in the UNIX 03 or other interface specifications, and so precise behavior varies between implementations. The following discussion describes the behavior which is specific to HP-UX 11i. 5.3.1 Data Buffering in Sockets The basic idea of the SO_RCVBUF and SO_SNDBUF socket options is to allow the application to regulate the amount of data that can be buffered in the pipeline between the sending side and the receiving side. The relationship between the values of these parameters and the amount of data in the pipeline is somewhat imprecise, as discussed below. For a TCP (SOCK_STREAM) socket, there are essentially 3 places data can be buffered: a) Data produced by an application and buffered before reaching the sending protocol code. b) Data "in flight" between the sender and receiver (including packets on the sending side which have been queued by the protocol code but not yet sent, and packets on the receiving side which have been received but not yet processed by the protocol code). c) Data buffered by the receiver after protocol processing, waiting to be consumed by an application. It is often the case that application programmers may make assumptions about the relationship between the values of the SO_SNDBUF or SO_RCVBUF parameters and the amount of data buffered at some or all of the places in the pipeline mentioned above. 5.3.2 Controlling Socket Buffer Limits If an application sets the value of the SO_RCVBUF parameter before establishing a connection, TCP will use the value of SO_RCVBUF when negotiating the TCP receive window. If the value is large, then a window scaling option value may be negotiated. The receive window size of the peer, in combination with the SO_SNDBUF set by a sending application and TCP congestion control algorithms, limit the total amount of data that can be buffered in a single direction over a given connection.
28
Generally speaking, the value set by the application for the SO_SNDBUF size will be an approximate limit on the sum of the values (a) and (b) above, and the SO_RCVBUF size will be an approximate limit on the sum of the values of (b) and (c) above. TCP socket applications should not be written to assume a precise control over the amount of data in the send or receive pipeline, or any precisely synchronized relationship between send and receive socket buffer sizes and how much data can be read or written in a single call before blocking (for blocking sockets) or returning from the call (for nonblocking sockets). A recv() call made on a blocking or nonblocking TCP socket can return any number of bytes up to the size of the buffer passed in to the call, regardless of the SO_RCVBUF setting. Similarly, a send() on a nonblocking TCP socket may be able to send any number of bytes up to the size of the buffer passed in to the call, regardless of the SO_SNDBUF setting. 5.3.3 System Socket Buffer Tunables
The system administrator can specify default values for these parameters to use for applications which do not set a value for these options. In addition, limits can be placed on the values allowed for applications which do set a value. These default and maximum values are specified using the ndd tunables listed below: Transmit parameters: tcp_xmit_hiwater_def: default SO_SNDBUF value tcp_xmit_hiwater_max: maximum value applications may set Receive parameters: tcp_recv_hiwater_def: default SO_RCVBUF value tcp_recv_hiwater_max: maximum value applications may set In addition to these "high water mark" parameters, the transmit side also has a related "low water mark" ndd tunable: tcp_xmit_lowater_def: the amount of unsent data that relieves write-side flow control The default values are used for cases where applications do not explicitly set a value via setsockopt() for sockets or t_optmgmt() for XTI. Please consult Appendix B for a detailed discussion of the precise meaning of each of these parameters.
5.4 Effective use of the listen backlog value

In section 4.2.4 there was discussion of management of the listen backlog from a system administrator's point of view, when dealing with web servers. That discussion is for the most part equally applicable to any "network server" program that uses a listening TCP endpoint via either the socket or XTI programming interface. Many programming examples for network server programs use hard-coded values for the listen backlog, and often use very small values such as 5, 10, or 20. Such values are good for textbook examples but often not for scalable "real world" servers. When developing a network server application it is important to realize that often there is no universal best choice of the listen backlog which will work best in all situations, particularly when the program will be
29
used over a range of workload conditions, over a range of systems with various performance characteristics. Therefore it is highly beneficial to design such programs so that the choice of listen backlog value can either be configured directly, or adjusted indirectly as a result of some parameters which are based on the workload to be handled by the application. These parameters could be specified as command line options, in a configuration file, or by means of some kind of interactive administrator interface. The simplest choice to make is to implement a parameter which is used directly as the value that specifies the listen backlog used by the program when setting up the listening TCP endpoint. This is specified in the socket programming interface as the second parameter to the listen()system call. A less direct alternative would be to give the system administrator a parameter which will specify the fraction of "system resources" to be used by the application; such a parameter could be used to calculate the values the program will use for various parameters such as the listen backlog.
30
Monitoring Network Performance
There are many tools available on HP-UX for monitoring network statistics. Of these tools, netstat(1) and lanadmin(1M) are very useful to monitor network statistics especially to identify the following conditions which may potentially have an impact on network performance: Retransmissions Out-of-order packets Window probes Duplicate Acks Ack-only packets Bad Checksums Dropped Fragments UDP socket buffer overflows Inbound packet discards/errors Outbound packet discards/errors Packet Collisions
6.1 Monitoring network statistics

The netstat command displays statistics for network interfaces and protocols. The output format varies according to the options selected. The netstat command without any options displays all active internet connections and all active UNIX domain sockets. This is useful for tracking what connections are being used on the system at any point in time. Using the -a option shows the state of all sockets, active or not, this option will be useful since one can observe all the listen sockets too. The netstat -i command shows statistics for all interfaces. It shows MTU size for each network interface, and number of inbound and outbound packets processed. The netstat -s command provides statistics for all the protocols like TCP, UDP, IPv4, ICMP, IGMP, IPv6 and ICMPv6. To diagnose networking problems, it is best to use data collected over an interval and not all the statistics since boot. There is an unofficial tool called beforeafter which can be used to subtract one set of netstat statistics from another. This tool can be retrieved via the URL: ftp://ftp.cup.hp.com/dist/networking/tools/beforeafter.c Typically, one might use beforeafter as follows: # # # # # netstat -s > before do something or sleep for a time netstat -s > after beforeafter before after > delta more delta
31
6.1.1
Monitoring TCP connections with netstat an
Each socket results in a network connection. Use netstat -an command to determine the state of your existing network connections. The following example shows the contents of the protocol control block table and the number of TCP connections currently in each state: # netstat -an | grep tcp | awk '//{print $6}' | sort | uniq -c
1 728 23 12 51 11 9163 CLOSE_WAIT ESTABLISHED FIN_WAIT_1 FIN_WAIT_2 LISTEN SYN_RCVD TIME_WAIT
For Web servers, where the server initiates the closing of a connection, it is normal for the majority of connections to be in a TIME_WAIT state. Each TCP connection not in TIME_WAIT state requires approximately 12K bytes of memory in HP-UX, including memory for sockets, STREAMS, and protocol data structures. Connections in TIME_WAIT state require only a minimal amount of state to be maintained. Note that in this example, there are almost 10,000 TCP connections being used, less than 1000 of which are not in TIME_WAIT state, requiring less than 12 MB of memory. 6.1.2 Monitoring protocol statistics with netstat -p netstat p protocol is a subset of the netstat s output. Use the netstat p tcp command to check for retransmissions, connect request dropped, out-of-order packets, and bad checksums for the TCP protocol. Use the netstat p udp command to look for bad checksums and socket overflows for the UDP protocol. You can use the output of these commands to identify network performance problems by comparing the values in some fields to the total number of packets sent or received. A large number of entries in the connect requests dropped due to full queue field may indicate that the listen queue is too small or clients have canceled requests. Section 4.2.4 describes how to increase the size of the listen queue. Other important fields to examine include the completely duplicate packets, out-of-order packets, and discarded fields. For example: # netstat -p tcp tcp:
122850444 packets sent 55697656 data packets (2041533911 bytes) 56238 data packets (46227636 bytes) retransmitted 68319667 ack-only packets (4200 delayed) 0 URG only packets 0 window probe packets 1166089 window update packets 639 control packets 283493903 packets received 4986072 acks (for 2039278403 bytes) 1172 duplicate acks 0 acks for unsent data 278496894 packets (1635765768 bytes) received in-sequence 0 completely duplicate packets (0 bytes) 0 packets with some dup, data (0 bytes duped) 8705 out of order packets (12431596 bytes) 0 packets (0 bytes) of data after window 787 window probes 152 window update packets 0 packets received after close 0 segments discarded for bad checksum 0 bad TCP segments dropped due to state change 41 connection requests 357 connection accepts 398 connections established (including accepts) 241 connections closed (including 0 drops)
32
0 embryonic connections dropped 4985775 segments updated rtt (of 4985775 attempts) 0 retransmit timeouts 0 connections dropped by rexmit timeout 0 persist timeouts 2 keepalive timeouts 0 keepalive probe sent 0 connections dropped by keepalive 0 connect requests dropped due to full queue 0 connect requests dropped due to no listener 0 suspect connect requests dropped due to aging 0 suspect connect requests dropped due to rate
Retransmitted segments are an indication of loss, delay, or reordering of segments on the network, and will have a strong negative effect on TCP throughput. Important fields for the netstat p udp command include the bad checksums and socket overflows fields, which should have low values. For example: # netstat -p udp
udp: 0 incomplete headers 11 bad checksums 3318 socket overflows
The previous example shows a value of 3318 in the socket overflows field, which indicates that the UDP socket buffer may be too small. You can tune the default UDP socket buffer size larger using the following ndd command: # ndd -set /dev/sockets socket_udp_rcvbuf_default 65535 If you have UDP applications that need larger socket buffer sizes, HP recommends that you set the socket buffer sizes using setsockopt() instead of setting the system-wide default to a larger value. Regardless, it should be noted that simply increasing the size of the socket buffer only helps if the length of the overload condition is short and the burst of traffic is less than the size of the socket buffer. Increasing the socket buffer size will not help if the overload is sustained, or if the overload is caused by a slow or unresponsive application that is not receiving data from the connection at a rate at least equal to the rate from the network. There are two most common ways to identify the overflowing UDP socket: ndd get /dev/ip ip_udp_status lists the UDP fanout table and one of the columns is overflows. If the UDP port is long-lived, this will tell which UDP socket is experiencing the overflows. Collect a network trace during one of the overflows and look for an ICMP SOURCE QUENCH packet. If the SOURCE QUENCH was generated because of a UDP socket overflow, the SOURCE QUENCH packet will tell you which UDP socket is overflowing.
33
Important fields for the netstat p ip command include the statistics for dropped fragments, which should have low values. For example: # netstat -p ip ip:
12278434 0 257 0 62 0 0 total packets received bad IP headers fragments received fragments dropped (dup or out of space) fragments dropped after timeout packets forwarded packets not forwardable
The fragments dropped due to out of space can arise from packet loss rates and settings on the limits of memory consumed by IP fragment reassembly (refer to ndd tunable ip_reass_mem_limit in Appendix B). The only time there should be fragments dropped after timeout is when there is packet loss in the network. # netstat p ipv6 ipv6:
6240733 5 3206 3 total packets received bad IPv6 headers fragments received fragments dropped
The bad IPv6 headers are due to malformed IPv6 headers or IPv6 extension headers. The fragments dropped could be due to the loss rate or the configured maximum amount of memory for IPv6 fragment reassembly (Refer to the ndd tunable ip6_reass_mem_limit in Appendix B). It can also be related to the timeout while waiting for missing fragments (Refer to ndd tunable ip6_fragment_timeout in Appendix B). For an annotated output of netstat p protocol, refer to Appendix A. 6.1.3 Monitoring link level statistics with lanadmin Use the output of lanadmin g command to check for input errors, output errors, and collisions. Compare the values in these fields with the total number of packets sent. High values may indicate a network problem. For example, cables may not be connected properly or the Ethernet may be saturated.
# lanadmin -g 0 LAN INTERFACE STATUS DISPLAY Wed, May 2, 2007 13:30:17 PPA Number Description B.11.31.01 Type (value) MTU Size Speed Station Address Administration Status (value) Operation Status (value) Last Change Inbound Octets Inbound Unicast Packets Inbound Non-Unicast Packets = 0 = lan0 HP PCI Core I/O 1000Base-T Release = = = = = = = = = = ethernet-csmacd(6) 1500 100000000 0x306e4c1c52 up(1) up(1) 100 310558225 125478 26469103
34
Inbound Discards Inbound Errors Inbound Unknown Protocols Outbound Octets Outbound Unicast Packets Outbound Non-Unicast Packets Outbound Discards Outbound Errors Outbound Queue Length Specific Ethernet-like Statistics Group Index Alignment Errors FCS Errors Single Collision Frames Multiple Collision Frames Deferred Transmissions Late Collisions Excessive Collisions Internal MAC Transmit Errors Carrier Sense Errors Frames Too Long Internal MAC Receive Errors
= = = = = = = = = =
0 0 881 10832833 132612 140 0 0 2 655367
= = = = = = = = = = = =
1 0 0 326 742 0 0 0 0 0 0 0
The following fields should be small values if the system is running with optimal performance: Inbound Discards Inbound Errors Outbound Discards Outbound Errors If there are excessive Inbound Discards, the system could be extremely busy or the system is under a packet storm. See Section 3.3 Protection from Packet Storms for tuning recommendation. Refer to lanadmin (1M) for additional options to display link level and DLPI statistics. They are useful in identifying whether packets are being discarded at the link level or at the DLPI level.
6.2 Monitoring System Resource Utilization

Monitoring the utilization of system resources such as CPU, memory, caches etc is very essential to identify performance bottlenecks. These bottlenecks must be well understood in order to improve the network performance. 6.2.1 Monitoring CPU Utilization Using Glance On HP-UX, glance(1) is a performance tool for displaying the breakdown of CPU utilization. Glance has the following screens of metrics, which may be selected by pressing the labeled softkey or by typing the following command: a (CPU BY PROCESSOR) c (CPU Report)
35
If any CPU is saturated, it could potentially become a bottleneck for network performance, especially if the saturated CPU is handling the interrupts from the network interface. Additional profiling can be taken using Caliper to identify hot spots for CPU utilization. 6.2.1.1 Key CPU Performance Indicators in glance output: The key glance CPU performance indicators are the following: User - % of time CPU spent by processes in user mode Sys - % of time CPU spent for processing system calls Intrpt - % of time CPU spent for processing interrupts CSwitch - % of time CPU spent for context switching
Abnormally high values in these values can potentially lead to a performance bottleneck. However, it is not straightforward to tell what values are abnormal as there are numerous factors in determining these values: whether the applications are compute-intensive or I/O intensive, speed of the CPU, whether the application is interactive or batch processing, whether the application is a server or client, etc. Heuristics can sometimes help identify abnormal values. For example, in typical OLTP environment the ratio of 70% User and 10% system usage is ideal. Compute-bound environments may see ratios of 90% user to 10% system usage or more. Allowing a small percentage of idle CPU helps to provide for peaks or for expected growth in usage. High values for system usage might indicate problems like memory thrashing or SMP contention. On the other hand, high values for system usage do not necessarily indicate a performance problem. They might only indicate that the application utilizes a lot of system calls.
6.2.1.2 Key factors for CPU saturation The key factors that cause CPU saturation are the followings: Expensive system calls, for example fork(2), select(2), poll(2), etc. SMP contention involving spinlocks. Cache misses. Processing interrupts due to heavy network traffic. Context switching caused by process scheduling. Processing page faults, TLB misses, traps. 6.2.2 Monitoring CPU statistics using Caliper Caliper is an extensive run-time performance analyzer that can be run on HP Integrity Servers with Itanium2 processors for both kernel and applications. On HP Integrity Servers, caliper can be used to efficiently identify CPU bottlenecks if CPUs are saturated by collecting cpu execution statistics. For example: # caliper cpu -o cpu.txt my_program This command runs my_program, measuring and reporting the overview metrics by taking one sample every 8 milliseconds. By default, 125 low-level samples will be aggregated into one user-reported sample, resulting in one aggregated sample collected per second. The text report is saved in the text file cpu.txt. For the CPU measurement report, you may specify one or more comma-separated lists of predefined event sets. Here are some key CPU events that can be monitored using caliper: cpi stall Provides metrics related to cycles per instruction (CPI). Provides metrics on primary CPU performance limiters by breaking the CPI into seven components.
36
l1dcache Provides miss rate information for the L1 data cache. l2cache Provides miss rate information for the L2 cache. tlb Provides metrics related to translation lookaside buffer (TLB) misses.
6.2.2.1 Using caliper to Identify Bottlenecks The following caliper profilers can be used to identify bottlenecks depending on the type of performance issues: fprof Find the most cpu-intensive code cstack Find where the program is waiting for system calls, locks, or I/O traps Locate code causing traps, faults, and interrupts dcache Find data access performance problems Each profiler identifies the "hot" process, load module, function, source code statement and machine instruction. For additional details, refer to the Caliper documentation which is available at http://www.hp.com/go/caliper. This web site provides documentation on how to use these profiles to identify performance problems in great detail.
6.2.3 Monitoring Memory Utilization using Glance: Glance can be used to view system memory utilization. The following example shows the breakdown of the memory utilization on a test system:
# glance
B3692A GlancePlus C.04.55.00 17:04:52 hpipxpr 9000/800 Current Avg High -------------------------------------------------------------------------------CPU Util SS | 4% 1% 5% Disk Util F | 1% 1% 8% Mem Util S SU U | 28% 28% 28% Networkil U UR | 8% 8% 8% -------------------------------------------------------------------------------MEMORY REPORT Users= 3 Event Current Cumulative Current Rate Cum Rate High Rate -------------------------------------------------------------------------------Page Faults 0 16186 0.0 25.7 926.4 Page In 0 7254 0.0 11.5 477.1 Page Out 0 0 0.0 0.0 0.0 KB Paged In 0kb 0kb 0.0 0.0 0.0 KB Paged Out 0kb 0kb 0.0 0.0 0.0 Reactivations 0 0 0.0 0.0 0.0 Deactivations 0 0 0.0 0.0 0.0 KB Deactivated 0kb 0kb 0.0 0.0 0.0 VM Reads 0 0 0.0 0.0 0.0 VM Writes 0 0 0.0 0.0 0.0 Total VM : 380mb Sys Mem : 691mb User Mem: 288mb Phys Mem : 4.0gb Active VM: 229mb Buf Cache: 3mb Free Mem: 2.9gb FileCache: 151mb -------------------------------------------------------------------------------
The key glance performance indicators for memory utilization are Page Faults, Page In, Page out and Free Mem. It is important to note that excessive paging can have a very adverse affect on overall performance. Performance can be severely hit with paging because processors are much faster than disks. Having to wait for memory to be paged in and out to disks can be very time consuming. Another
37
aspect of paging is that it can cause external fragmentation of the memory. This fragmentation will cause fewer large pages available for use by applications. 6.2.4 Monitoring Memory utilization using vmstat
The vmstat(1)command reports certain statistics kept about process, virtual memory, trap, and CPU activity. For example: # vmstat -S
procs cpu r sy 1 3 0 0 71646 758881 0 0 0 0 0 0 0 2467 21014 360 1 b w avm free si so pi po fr de sr in sy cs us memory page faults
The key vmstat performance indicators are the following: pi page in po page out
6.2.5 Monitoring Cache Miss Latency The processor caches used for data and instructions can greatly affect the performance of the system. Caches are used specifically to store information that is read in with the expectation that it will be needed again soon by the CPU. In a multitasking environment, the cache contains entries belonging to many different processes, in the expectation that the processes will execute again before needed locations are flushed by other processes. Cache problems become visible as overuse of user or system CPU resources. You can see this in CPU utilization metrics, or you can compare performance metrics on systems with different cache sizes or organizations. Unfortunately, it is very difficult to detect cache problems directly. One system with millions of cache misses per second may actually be performing well while another system with several hundred thousand a second may not perform well. What really matters is how well the processor is able to overlap the cache miss time with useful work. Caliper can be used to monitor the cache misses. In addition, other cache metrics such as latency per miss can be tracked. It is a very useful tool for analyzing cache behavior on HP Integrity Servers. The "dcache" option in caliper can be used to find data access performance problems.
6.2.6 Monitoring other resources Monitoring the performance of other resources such as switches and I/O backplane bandwidth, which can potentially become bottlenecks for network performance, is also essential for improving network performance.
6.3 Measuring Network Throughput

The netperf utility is a very useful tool to measure the throughput of the network and various other aspects of network performance. It can generate traffic patterns such as bulk data transfer and request/response type of data transfer. It is also very useful for measuring performance for TCP, UDP, and the datalink layer. The complete reference manual is available at http://www.netperf.org.
38
6.3.1 Measuring Throughput with Netperf Bulk Data transfer Using the netperf bulk data transfer feature is useful for verifying the throughput between two hosts and measure available bandwidth between two hosts. On remote host B, start the netserver: # netserver Starting netserver at port 12865 Run netperf from host A: # netperf -H hpipxprlan2 -l 60 -t TCP_STREAM -- -m 32768 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to hpipxprlan2 (192.168.138.160) port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 32768 32768 32768 60.01 94.48
The default socket buffer size at the receiving side is 32768 bytes. The default socket buffer size at the sending side is 32768 bytes. The message size is 32768, specified using -m option. The elapsed time is 60.01 seconds specified using -l option. The throughput obtained is 94.48 M bits/s.
6.3.2 Measuring Transaction Rate with Netperf request/response: Request/response (RR) performance is often overlooked, yet it is just as important as bulk-transfer performance. Netperf request/response traffic will be useful to simulate bursty types of network traffic such as OLTP and short web transactions. While a bulk-transfer test reports its results in units of bits or bytes transferred per second, an RR test reports transactions per second where a transaction is defined as the completed exchange of a request and a response. One can invert the transaction rate to arrive at the average round-trip latency. Here is a sample of netperf request/response collected on host A: # netperf -H hpipxprlan2 -l 60 -t TCP_RR -- -r 128,1024 TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to hpipxprlan2 (192.168.138.160) port 0 AF_INET Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 32768 32768 32768 32768 128 1024 60.01 4998.65
The default socket buffer size at the receiving side is 32768 bytes. The default socket buffer size at the send side is 32768 bytes. The request size is 128 bytes as specified by -r option. The response size is 1024 bytes as specified by the -r option. The elapsed time is 60.01 seconds specified using -l option. The transaction rate obtained is 4998.65 transactions per second.
39
6.3.3 Key issues for throughput with Netperf traffic Common issues for throughput for netperf transfer are the following: Throughput is limited by the speed of the slowest link in the path from host A to host B. The transaction rate is limited by the network latency between host A and host B. High CPU utilization can also be the limiting factor for bulk transfer throughput, or can result in additional latency for request/response transactions.
6.4 Additional Monitoring Tools

Additional HP-UX tools are available for monitoring network performance: intctl(1M) ioscan(1M) lanscan(1M) nettl(1M) nwmgr(1M) ping(1M) sar(1M) tcpdump(1) top(1) traceroute(1) display and modify the interrupt configuration of the system display hardware configuration paths including processors, disks, and Ethernet adapter display LAN device configuration and status network tracing and logging network interface manager command (11i v3) shows round trip latency and the hops to get from point A to B. system activity reporter which is useful for monitoring CPU, memory and disk I/O display packet headers on a network interface display and update information about the top processes on the system display the route packets take to network host
Note: With the introduction of nwmgr(1M), the lanadmin(1M), lanscan(1M), and linkloop(1M) commands are deprecated. These commands will be removed in a future HP-UX release.
40
Appendix A: Annotated output of netstat s (TCP, UDP, IP, ICMP)

This chapter contains the annotated output of the netstat s command run on an HP-UX 11i system. The following annotated output is from a test system for illustration only. There may not be any particular problem or issue as indicated by the output.
TCP:
The TCP statistics can be retrieved by themselves with the command netstat -p tcp. tcp: 612755429 packets sent This is the total number of packets sent by TCP since boot. A "TCP packet" is referred to as a "segment." This value includes both data and control segments. A control segment could be a standalone ACK or window update, or a SYNchronize, FINished, or ReSeT segment.
414543546 data packets (3533061789 bytes) This is the number of TCP segments sent which carried data, and the total quantity of data they carried. If one divides the octets (bytes) by the number of segments (packets) one has some idea of the average quantity of data per segment. In general, TCP will move data more efficiently the greater the amount of data per segment is, with an upper limit of the Maximum Segment Size (MSS). This is not always possible if applications are naturally making small sends, or if the TCP_NODELAY option is set unnecessarily, preventing TCP from aggregating multiple small sends from the application into a single large segment. For some applications (eg Telnet, ssh) which carry user-interactive traffic (eg keystrokes) the average segment size will necessarily be quite small. The average segment size could also be small because remote systems are advertising small values for the MSS (related ndd: tcp_mss_def, tcp_mss_min) or because this system is configured to not use Path MTU Discovery (related ndd: ip_pmtu_strategy, tcp_ignore_path_mtu). 6579 data packets (5914789 bytes) retransmitted This is the number of TCP data segments which were retransmitted. These retransmissions could have come from a TCP retransmission timer expiring (ndd: tcp_rexmit_interval_min, tcp_rexmit_interval_max, tcp_rexmit_interval_initial), or from a fast retransmission, which is triggered by the arrival of multiple duplicate ACKs. In either case, retransmitted segments are an indication of loss, delay, or reordering of segments on the network, and will have a strong negative effect on TCP throughput. As with initial data segment transmission, one can arrive at an average retransmitted segment size by dividing the octet count by the segment count. As far as a range of acceptable values for the number of retransmitted segments, it depends entirely on the conditions. Retransmission rate of as little as 1% can have a considerable effect on overall throughput. It can also have a noticeable effect on perceived responsiveness. Consider for example, the case where one is simply typing at an ssh session. Consider also that the average round-trip time is something like 10
41
milliseconds. A 1% retransmission rate implies that one packet out of 100 is being retransmitted. An ssh session is not necessarily going to have "fast" retransmits, and the default value for tcp_rexmit_interval_min is 500 milliseconds. This means that the weighted average for keystroke round-trip time (i.e. echo) is going to be: 0.99 * 10 milliseconds + 0.01 * 500 milliseconds which works-out to 14.9 milliseconds - that 1% retransmission rate is adding nearly 50% to the average response time. It also means that one character out of 100, or as often as once every few lines typed, there will be a halfsecond "hiccup" in the flow of characters. Whether or not this is noticed by the person typing depends on just how good a typist they happen to be. 198929341 ack-only packets (19851063 delayed) This is the number of segments transmitted that were "standalone" ACKs - ie ACKnowledgement segments that were not also carrying window updates or piggy-backed on data segments. As such, they were pure overhead - in broad hand-waving terms, especially with NIC functionality like Checksum Offload (CKO) and copy-avoidance (eg sendfile() ) sending (or receiving) an ACK is just as expensive in terms of CPU consumption as sending a data segment. The numbers of "delayed" ACKs are those that were sent after the standalone ACK timer expired (related ndd: tcp_deferred_ack_interval). There are some situations that call for an immediate standalone ACK - for example when a SYNchronize segment is received to initiate a new connection a SYN|ACK is sent immediately. Another situation that requires an immediate ACK is the receipt of out-of-order data. When out-of-order data is received, TCP is required to send an immediate ACK specifying which data it was expecting to receive next. For example, if the remote system sent us TCP segments 1, 2, 3, 4 and 5, but segment two was either lost or delayed, we would sent-out an immediate "ACK 2" upon receipt of segments 3, 4 and 5. If the standalone ACK timer expired to cause the transmission of the ACK, it implied that none of the other piggy-back or immediate ACK transmission rules were in effect. This could still be harmless for example, in a request/reply application with long times between transactions or long transaction processing times (ie longer than tcp_deferred_ack_interval) standalone ACK's will be sent - either to ACKnowledge the receipt of the response (long time between transactions) or of the request (long transaction processing time) or both. Another situation that leads to standalone ACK's are those applications which provide logically associated data to the transport in separate socket calls. This is sub-optimal because if those different pieces are smaller than the MSS (Maximum Segment Size), the application will run afoul of the Nagle algorithm (related to the TCP_NODELAY socket option), which is in TCP to help make sure that the average segment size is reasonably large. 10 URG only packets This is the number of segments which contained URGent data.
42
66 window probe packets A window probe is sent after a remote TCP has advertised a window of zero bytes for some period of time. The intent of the window probe is to solicit a window-update from the receiving TCP. If the receiving TCP advertised a zero window, it means that the receiving application stopped reading data from its end of the connection. This could mean that the remote application is not receiving enough CPU cycles and thus gives the appearance of being hung, or it could mean that the remote application is indeed "hung" for some reason - blocked on some other resource or stuck in a tight loop. HP-UX 11i TCP uses the calculated TCP Retransmission Timeout (RTO) values for its "persist timer" which triggers the zero window probes. 720391 window update packets These are the number of segments sent which contained window updates. 562952 control packets These are segments such as SYN, FIN, or RST that are "control" segments. 624402552 packets received This is the total number of segments received. 222995941 acks (for 3530654798 bytes) This is how many of the received segments contained ACK's and how many octets (bytes) those ACKs were acknowledging. 354710 duplicate acks This is the number of duplicate ACK's we have received. If a TCP connection receives three duplicate ACKs in a row, it will trigger a so called "fast retransmission." While this statistic gives the total number of duplicate ACK's it does not tell us enough about their distribution to know how often there were three or more in a row. However, one might make some educated guesses based on the "data packets retransmitted" and "retransmission timeout" statistics. 0 acks for unsent data This should always be zero. If the stack is receiving ACK's for data the stack did not send, it means that something may be seriously wrong - most likely is the remote stack thinking it has received data we didn't actually send, less likely in this stack forgetting what data it has sent. In addition to actual bugs in the stack(s) this could also be triggered by some rogue entity spoofing packets in an attempt to hijack a TCP connection. When the local stack receives an ACK for data it does not believe it has sent, it is considered a serious TCP violation and the connection is summarily aborted with a RST segment. 434810727 packets (2860544293 bytes) received in-sequence This is the total number of segments received in order. Receiving data in order is good - it means that no segments were reordered (on the part of the network) or lost (suggesting other network problems).
43
As with the sending side, you can compute the average inbound segment size by dividing the octet count by the segment count. 2 completely duplicate packets (2920 bytes) A completely duplicate segment, while not fatal, does indicate a slight problem with either the network or the remote stack. One of the likely scenarios of a completely duplicate segment is a failure in the remote stack's Round Trip Time (RTT)/Retransmission Timeout (RTO) mechanisms. This could be because the remote stack might set the tcp_rexmit_interval_max to too small a value. It could also be that the remote value of tcp_rexmit_interval_min is too small for the accuracy of the stack's RTT/RTO algorithms. Another retransmission failure is related to the fast retransmission algorithm and those duplicate ACKs. If the network is reordering but not losing segments, the arrival of out-of-order data could be triggering enough duplicate ACKs to be sent by this system to trigger the remote systems fast retransmission algorithm. The remote system then sends a retransmission for data that was not lost, simply delayed by reordering.
59 packets with some dup. data (64111 bytes duped) Typically, TCP will send as large a segment as it can whenever it retransmits; starting at the sequence number at the beginning of what it perceives to be the first lost data in the sequence space. Sometimes, this retransmitted segment will have more data than the initial transmission, and this can lead to receipt of segments with some duplicate data. This will only happen when the initial sends were "sub-MSS" in size. 32757 out-of-order packets (34818258 bytes) This is an indication of how badly the network is either dropping or reordering segments from the remote systems. The higher this statistic the worse shape the network. 0 packets (0 bytes) of data after window 3422 window probes This is the number of times remote systems have sent this system zero window probes. It is a lower bound of the number of times a local application or applications have stopped receiving data from their end of the connection and the socket buffers have filled. 400852116 window update packets This is the number of segments received which contained window updates. 2758 packets received after close Segments received after close can be caused by packet reordering and delay in the network, or by an application handshake failure. The handshake failure would be that an application or applications on this system called close() or shutdown(SHUT_RDWR) before the remote system expected it and the remote system was still sending data to this system.
44
If the cause is application handshake failure then the applications should be inspected to make sure that the handshake is either repaired or the effects of the handshake will not cause data loss or corruption. 0 segments discarded for bad checksum On a private network, or intranet, this value should be very, very small. It may be larger for an Internet connected system. Even then however, the value should be a very small fraction of the total number of TCP segments received. 0 bad TCP segments dropped due to state change An incoming TCP segment is discarded because it is in an invalid TCP state. This value should be very small. 199570 connection requests This is the number of "active" opens performed on this system. An "active" open is when an application calls connect(). This system was the one to initiate connection establishment by sending a SYNchronize segment to the remote system(s). 87471 connection accepts This is the number of "passive" opens performed by this system this is the number of times we accept()ed a connection initiated by remote systems. 287041 connections established (including accepts) This is simply the sum of the two prior statistics. 335394 connections closed (including 48427 drops) This is the number of connections which have been terminated either through a proper graceful shutdown initiated by the application calling shutdown() or close() (without setting SO_LINGER for an abortive close) i.e. an exchange of FIN segments, or through an abortive close (the drops) - receipt or transmission of RST segments in response to an SO_LINGER setting or some other protocol violation. The number of abortive closes (aka drops) should be very much smaller than the number of graceful closes. Ideally, it would be zero. 27589 embryonic connections dropped This is the number of TCP connections which were closed before they could transition to the ESTABLISHED state. It could mean that this system is trying to establish connections to remote systems which are not listening for connection requests, or are being rejected by firewalls or other means. It could also mean that some remote rogue system or systems are trying to initiate some sort of denial of service attack by flooding the local machine with bogus connection requests. 222735273 segments updated rtt (of 222735273 attempts) In broad terms, the only segments which can update the RTT (Round Trip Time) are those which are transmitted and then ACKed without a retransmission. If the retransmission rate increases, the ratio of RTT updates to attempts will get worse.
45
6144 retransmit timeouts This is the number of times TCP's retransmission timer has expired on a connection or connections. When this timer expires a TCP segment will be retransmitted. In general, the Retransmission Timeout (RTO) estimator in HP-UX 11i is rather robust, which means the only time there would be spurious timeouts would be when someone has "tuned" tcp_rexmit_interval_* to bogus values. In cases where the network has a highly variable RTT one might want to set the ndd tunable tcp_smoothed_rtt to a value of one to select an even more robust RTO estimator. If the number of retransmit timeouts is larger than the number of data segments retransmitted, it implies that SYN segments are being retransmitted and the system is having trouble getting connections established. It could be that the remote system(s) have full listen queues and are discarding the SYN segments. It could also be that standalone FIN segments are being retransmitted. If the number of retransmit timeouts is smaller than the number of data segments retransmitted, it implies the difference between the two is roughly the number of "fast retransmissions" sent by this system. 459 connections dropped by rexmit timeout This is the number of connections where the total time a given segment has remained unACKed has exceeded the value of tcp_ip_abort_interval. It means that either the remote system or network died or was overloaded, or someone has tuned tcp_ip_abort_interval to too small a value. 66 persist timeouts This is the number of times the persist timer has expired on one or more TCP connections. When the persist timer expires a zero-window probe is sent to help "remind" the remote TCP that we are anxiously awaiting a window update. 32590 keepalive timeouts 27598 keepalive probes sent 11 connections dropped by keepalive If an application sets SO_KEEPALIVE with setsockopt(), and the connection has been idle (no data flow either way) for tcp_keepalive_interval milliseconds (default two hours) the TCP stack will have a keepalive timeout and may then send TCP keepalive probes to make sure the remote TCP is "still there." The keepalive mechanism will be enabled for any connection, regardless of the setting of SO_KEEPALIVE, once the application calls close() against the socket. In this case, keepalive probes will be sent after tcp_keepalive_detached_interval milliseconds. In either case, if the keepalive probe does not elicit a response from the remote within tcp_ip_abort_interval, the connection will be dropped (aborted) with a RST segment just as if it had been dropped by retransmission timeout.
0 connect requests dropped due to full queue If the number of connection requests (SYN segments) dropped due to full queue is non-zero it means that either the setting for tcp_conn_request_max is too small, or one or more applications on the system is using too small a value for the listen backlog in its call to listen().
46
It is also possible that those settings are good, but one or more applications have stopped calling accept() against their listen socket(s) - perhaps they are saturated, or perhaps they are caught in an infinite loop. 1585 connect requests dropped due to no listener This is the number of connection requests (SYN segments) dropped because there was no local endpoint in the LISTEN state. This could be the result of someone forgetting to start a server application, or configuring the wrong well-known port number at either the server or client. It could also be the result of a "probe" from some remote system looking for a particular service that is not running on the system. 0 suspect connect requests dropped due to aging This is the number of suspect connection requests dropped because it exceeds the maximum number of suspect connections that will be allowed to persist in SYN_RCVD state. See the ndd tunable tcp_syn_rcvd_max which controls the SYN attack defense of TCP. For SYN attack defense to work, tcp_syn_rcvd_max must be large enough so that a legitimate connection will not age out of the list before an ACK is received from the remote host.
0 suspect connect requests dropped due to rate This indicates that there are excessive suspect connections piled up and the remote system has not had valid connections established previously.
UDP:
0 incomplete headers On an 11i system, the field should have no meaning. 0 bad checksums A non-zero value for this statistic means that either the network is corrupting data occasionally, or there is a high rate of IP datagrams fragments being received and the 16-bit IP datagram ID between this and one or other hosts is wrapping quickly enough to generate "Frankengrams." "Frankengrams" are when the fragments of two otherwise unrelated IP datagrams which happened to have the same IP Datagram ID are joined into one datagram - some fragments from one datagram, some fragments from the other. For more on this, see the IP fragmentation tunable discussion in Appendix B. 0 socket overflows A non-zero value here means that receiving UDP applications are not draining their socket buffers fast enough. It implies that either the receiving apps need to be tuned, or they need to have some form of flowcontrol implemented. It would be best for UDP applications needing larger socket buffer sizes to get them via the programmatic interface setsockopt() socket call. Regardless, it should be noted that simply increasing the size of the socket buffer is only going to work if the length of the overload condition is short and the burst of traffic is less than the size of the socket buffer. Increasing the socket buffer size will not help if the overload is sustained.
47
There are two most common ways to identify the overflowing UDP socket: ndd get /dev/ip ip_udp_status lists the UDP fanout table and one of the columns is overflows. If the UDP port is long-lived, this will tell which UDP socket is experiencing the overflows. Collect a network trace during one of the overflows and look for an ICMP SOURCE QUENCH packet. If the SOURCE QUENCH was generated because of a UDP socket overflow, the SOURCE QUENCH packet will tell you which UDP socket is overflowing.
IPv4:
36340838 total packets received The total number of IP datagrams received by this host. 13 bad IP headers Bad headers could include IP header checksum failures, or unrecognized options and the like. 11 fragments received The total number of IP datagram fragments received. This is not enough data to tell how many fragmented IP datagrams were received however - as there is no data on the distribution of fragments per datagram. 0 fragments dropped (dup or out of space) An increment in this statistic could mean possible Frankengram situation when the fragments are reported duped, or simply that the fragments are indeed being duplicated by the network. The "out of space" can arise from packet loss rates and settings on the limits of memory consumed by IP fragment reassembly. (ip_reass_mem_limit see Appendix B for guidelines on setting the ip_reass_mem_limit in some conditions.) 0 fragments dropped after timeout The only time there should be fragments dropped after timeout is when there is packet loss in the network. In those cases, it is possible that a high-enough rate of IP datagram fragments from a remote system could wrap the 16-bit IP Datagram ID field and result in "Frankengrams" coming-out of IP fragment reassembly. This is a fundamental problem with IPv4 and not necessarily a bug in any specific implementation. "Frankengrams" are when the fragments of two otherwise unrelated IP datagrams which happened to have the same IP Datagram ID are joined into one datagram - some fragments from one datagram, some fragments from the other. The only thing that will detect this in the Transport would be the ULP (Upper Layer Protocol - eg UDP or TCP) Internet Checksum. So, one should _NEVER_ disable UDP checksums. Otherwise, it is up to the application to detect lest there be silent data corruption. TCP does much to avoid IP fragmentation, and so using TCP instead of UDP would be a decent workaround from that standpoint, but the workaround is not complete - especially if PathMTU discovery is disabled (ip_pmtu_strategy), or TCP is told to not adjust its send sizes based on the ICMP Fragmentation Needed messages generated by PathMTU Discovery (tcp_ignore_path_mtu).
48
The only real "fix" to this is to migrate to IPv6 where the ID field is much larger. 0 packets forwarded This is the number of IP datagrams the system has forwarded. If the system is not intended to be an IP router, and this value is non-zero, it means that one or more other systems are using this system as a router, and that someone forgot to set ip_forwarding to zero with ndd on this system. Two configuration problems need to be fixed.
56 packets not forwardable This is the number of datagrams this system could not forward either because it was configured to not forward IP datagrams, or because it tried to forward but could not find a route. In the former case, the remote systems trying to use the system as a gateway need to be reconfigured. In the latter, the routing tables need to be updated. The netstat r command displays the current routing tables and the route(1M) command is used to update these tables. Refer to the route(1M) man page for help on updating the routing tables.
IPv6:
6240733 total packets received The total number of IPv6 datagrams received by this node. 5 bad IPv6 headers The total number of bad IPv6 headers or IPv6 packets containing bad IPv6 extension headers received by this node. 3206 fragments received The total number of IPv6 datagram fragments received. 0 fragments dropped As with IPv4, an increment in this statistic could mean a possible Frankengram situation when the fragments are reported duped, or simply that the fragments are indeed being duplicated by the network. Fragments may also be dropped as a result of packet loss in the network (preventing reassembly of the original datagram), and as a result of settings on the limits of memory consumed by IPv6 fragment reassembly. (ip6_reass_mem_limit see Appendix B for guidelines on setting the ip6_reass_mem_limit in some conditions.)
49
ICMP:
219220 215278 Output echo calls to generate an ICMP error message ICMP messages dropped histogram: reply: 1134
An "echo reply" will be sent in response to an ICMP Echo Request aka the type of packet traditionally sent by the ping utility. destination unreachable: 2804 source quench: 0 One can use ndd to set ip_send_source_quench to zero, see Appendix B. routing redirect: 0 echo: 0 This is the number of "ping" (aka ICMP Echo) requests the system has sent. time exceeded: 0 This statistic will be incremented whenever the system is acting as a router and the Time To Live (TTL) field in an IP datagram to be routed has reached zero. Increasing values of this statistic can imply two things - first that the system sending the datagrams is using too small a TTL value. The second possibility is that there is a routing loop somewhere in the network. When there is a routing loop, the IP datagrams go around and around, with their TTL's decrementing at each router until they hit zero. parameter problem: 0 time stamp request : 0 time stamp reply: 0 address mask request: 0 address mask reply: 4 0 bad ICMP messages Input histogram: echo reply: 45807 destination unreachable: 4184 Destination unreachable messages come in several flavors. Some imply that there was no application waiting at that port number. Others are related to Path MTU discovery. source quench: 217050 routing redirect: 0 A routing redirect will be received by this system when it tries to use a given gateway, but that gateway believes there is a better way to reach the remote destination.
50
ICMPv6:
33 calls to generate an ICMPv6 error message 8 ICMPv6 messages dropped 12 ICMPv6 error messages dropped for rate control In IPv6, in order to limit the bandwidth and forwarding costs incurred by originating ICMPv6 error messages, an IPv6 node limits the rate of ICMPv6 error messages it originates. This situation may occur when a source sending a stream of erroneous packets fails to heed the resulting ICMPv6 error messages. One can use ndd to alter ICMPv6 error message rate control via the ip6_icmp_interval tunable. Output histogram: destination unreachable: 8 administratively prohibited: 1 An ICMPv6 destination unreachable message comes in several types. Some imply that there was no application waiting at that port number. Others could be generated when there is no route to the destination or, when the packet has to be forwarded but IPv6 forwarding is not enabled. One can use ndd to alter IPv6 forwarding via the ip6_forwarding tunable. time exceeded: 3 This statistic will be incremented in two cases: The system is acting as a router and the Hop-Limit field in an IPv6 datagram to be forwarded has reached zero. Increasing values of this statistic can imply two things first, that the Hop-Limit value being used by the system sending the datagrams is too small; ; second, that there is a routing loop somewhere in the network. When there is a routing loop, the IP datagrams go around and around, with their Hop-Limits decrementing at each router until they hit zero. Fragment reassembly timeout. One can use ndd to alter the fragment reassembly timeout via the ip6_fragment_timeout tunable. parameter problem: 1 An ICMPv6 parameter problem is generated if an IPv6 node processing a packet finds a problem with a field in the IPv6 header or extension headers such that it cannot complete processing the packet. packet too big: 1 An ICMPv6 Packet Too Big is sent by a router in response to a packet that it cannot forward because the packet is larger than the MTU of the outgoing link. echo: 6 echo reply: 19 An echo and echo reply are ICMPv6 informational messages. Traditionally, an echo is sent by using ping utility.
51
router solicitation: 4 router advertisements: 0 neighbor solicitation: 48 neighbor advertisement: 20 These are the statistics related to the Neighbor Discovery (ND) protocol. redirect: 0 An ICMPv6 redirect message is sent by a router to inform a host of a better first-hop node to reach a particular destination. group query: 0 group response: 72 group reduction: 2 These are the statistics related to the Multicast Listener Discovery protocol.
52
Appendix B: Annotated output of ndd h and discussions of the TCP/IP tunables

Proper tuning can result in the operating system more efficiently using network bandwidth and system resources like CPU and memory, and providing more of these resources to applications. In addition, in many cases, network performance can be greatly improved by removing conditions which lead to long protocol based delays. Network congestion, delayed or lost packets, and the dynamic nature of a network topology itself may be addressed by tunable parameters to obtain the best performance under the given circumstances. The following section contains an annotated list of the TCP/IP tunables that can set with ndd and the file /etc/rc.config.d/nddconf. The annotations provide additional information on the processes behind the settings. It is presumed that the reader has some basic knowledge of the workings of the TCP and IP and networking in general. Anyone who is unfamiliar with TCP/IP and related protocols should not be attempting to alter these ndd tunables. Note that not all recommendations are appropriate for all types of workloads and only some of the tunables are annotated. The ndd -h output shown below is from HP-UX 11i v3. Where a tunable is specific to a particular release, it will be so noted. Such a tunable is present on all releases of 11i. Although the tuning recommendations are provided for HP-UX 11i v3 release, they can also be extended for previous releases like HP-UX 11i v1 and 11i v2. This document will not discuss the "syntax" of setting ndd tunables via either ndd or the file /etc.rc.config.d/nddconf. It presumes that the reader is already familiar with the syntactic workings of ndd. See the ndd manpage and documentation available at http://docs.hp.com/ for information on ndd. The tunables are listed in alphabetical order, with the help text output by the command ndd h parameter followed by additional discussion and tuning suggestions.
IPv4 Tunables
ip_def_ttl: Sets the default time to live (TTL) in the IP header. [1,255] Default: 255 The TTL field of the IP header is used to ensure that an IP datagram eventually "dies" on the network. Each time an IP datagram goes through an IP router, the TTL is decremented by one hop. When an IP datagram's TTL reaches zero, it is discarded. This can be rather useful in the face of routing loops - without the decrementing TTL datagrams would run through the loop forever. Note that this is the value used for generic IP purposes. The TTL of IP datagrams containing either UDP datagrams or TCP segments is controlled via udp_def_ttl or tcp_ip_ttl respectively. The TTL used for raw IP access is set via rawip_def_ttl. It is unlikely that one would ever need to change this value. One possible case would be when the network administrator wanted to prevent the system from reaching other systems more than N hops away where N was the value selected for ip_def_ttl.
53
ip_forward_directed_broadcasts: Set to 1 to have IP forward subnet broadcasts. Set to 0 to inhibit forwarding. [0,1] Default: 1 A directed broadcast datagram has the broadcast IP address of a remote IP subnet as its destination IP address. Directed broadcasts will only be forwarded if ip_forward_directed_broadcasts is set to one and ip_forwarding is set to one or two. If either ip_forward_directed_broadcasts or ip_forwarding are set to zero, directed broadcasts will not be forwarded.
ip_forward_src_routed: Set to 1 to forward source-routed packets; set to 0 to disable forwarding. If disabled, an ICMP Destination Unreachable message is sent to the sender of source-routed packets needing to be forwarded. [0,1] Default: 1 In a source-routed datagram, the source of the datagram specifies explicit options that indicate which intermediate hops the datagram must take. Source-routed datagrams will only be forwarded if both ip_forward_src_routed is set to one and ip_forwarding set to one or two. If either ip_forward_src_routed or ip_forwarding are set to zero, no source-routed datagrams will be forwarded. ip_forwarding: Controls how IP hosts forward packets: Set to 0 to inhibit forwarding; set to 1 to always forward; set to 2 to forward only if the number of logical interfaces on the system is 2 or more. [0,2] Default: 2 The ip_forwarding tunable can be thought-of as the "master forwarding switch." If it is set to zero, no IP datagrams will be forwarded outside the box. If it is set to one or two, IP datagrams may be forwarded outside the box. If ip_forwarding is set to zero, the system will still "internally forward" datagrams received on the "wrong" (does not have the matching IP address) interface. If you want to limit the receipt of IP datagrams to only the interface with the matching IP address, you should consider setting ip_strong_es_model.
ip_fragment_timeout: Set the amount of time IP fragments are held while waiting for missing fragments. RFC1122 specifies 60 seconds as the time-out period for reassembly of IP datagrams. This is a long time, but may be appropriate for reassembling datagrams that have traversed an internet. On local file server systems, on the other hand, fragmentation reassembly will either take place very quickly, or not all, i.e., if all fragments are not received at about the same time, it is likely that one was dropped by the local interface, and will never arrive. In this case, holding fragments for 60 seconds may only
54
exacerbate the problem. If the parameter is set to a value that is large enough that IP wraps packet sequence numbers (IP starts to re-use its sequence numbers) while holding fragments for reassembly, it is possible that IP will assemble a packet with fragments from different packets. In this case, the problem will be detected only if the upper-layer is validating data integrity (using checksums). With a 10 MBit/second link and a 1500-byte MTU, IP sequence numbers may wrap within approximately 80 seconds. With a 100Mbit/second link, IP sequence numbers may wrap within approximately 8 seconds, and with a Gigabit Ethernet link, IP sequence numbers may wrap within approximately 0.8 seconds. This parameter is specified in milliseconds. The actual value used is rounded to the nearest second. [100, - ] Default: 60000 (60 seconds) One of the IP datagrams header fields used to select which fragments go together is the IP datagram ID field. This is an unsigned 16-bit quantity. It generally does not take very long to "wrap" an unsigned 16-bit quantity as there are only 65536 distinct values of a 16-bit counter. For example, it is possible for a fragment of one IP datagram with ID "one" to arrive, but the other fragments to be lost. Other retransmission mechanisms for the upper layer protocols (TCP, UDP, etc) will retransmit. If another unrelated IP datagram fragment with the same ID (and other values such as protocol ID and source and destination IP address) arrives later, before the ip_fragment_timeout, a "Frankengram" IP datagram could result built from parts of otherwise unrelated IP datagrams. At this point, only upper-layer protocol checksums can prevent undetected data corruption. Another consideration about this tunable is that if it is set too high and the network is lossy, the fragmentation memory can get used up very quickly. While this memory is waiting to timeout, good fragments cannot be accepted. See discussion on ip_reass_mem_limit.
ip_icmp_return_data_bytes: The maximum number of data bytes to return in ICMP error messages. [8,65536]. Default: 64 bytes
ip_ill_status: Display a report of all allocated physical interfaces. This will display interfaces IP believes to be "physical" - the interfaces with IP index ":0", such as lan1:0 (which is the same as lan1; see the ifconfig(1m) manpage for more information). Note that an interface with IP index ":0" may not actually be a true physical interface - it could be a virtual interface created by Auto Port Aggregation (APA). Since link trunking is done without the knowledge of IP, IP cannot, nor does it need to, distinguish between a "true" physical interface and a virtual interface created through trunking. The interface could also be a "vlan" interface, which is something of the inverse of an APA trunk - instead of aggregating multiple NICs together, VLANs subdivides a physical NIC. Virtual interfaces such as tunnels and PPP devices would also not be displayed.
55
ip_ipif_status: Display a report of all allocated logical interfaces. A logical interface is created whenever one adds IP addresses to the one already assigned to a "physical" interface. These are also sometimes referred to as aliased addresses and are given names such as lan0:1, lan0:2 and so on. These exist only in the "mind" of IP. Neither the NIC driver nor DLPI knows of its existence. While a physical/virtual interface (see the discussion of ip_ill_status) will appear in the output of lanscan(1m), a logical interface will not. It is possible to have a virtual interface created with Auto Port Aggregation treated by IP as a physical interface on which there are several, additional, logical interfaces. In the future, that "virtual" interface could be a VLAN operating of the same APA aggregate. ip_ire_hash: Display a report of all routing table entries, in the order searched when resolving an IP address. The Internet Routing Entry (IRE) is the primary data structure that links IP addresses with particular interfaces, attached networks, gateways, and local and remote hosts. The corresponding data structure for IPv6 is an IRE6.
ip_ire_status: Display a report of all routing table entries. Same information as in ip_ire_hash, but format and ordering are different.
ip_ire_cleanup_interval: Sets the time-out interval for purging routing table entries. All entries unused for this period of time are deleted. [5000, - ] Default: 300000 (5 minutes)
ip_ire_flush_interval: All routing table entries are deleted after this amount of time; even those which have been recently used. [60000, -] Default: 1200000 (20 minutes) Host routes associated with the IRE_ROUTE flag from the ndd get /dev/ip ip_ire_status output will be deleted. A host route with the IRE_ROUTE flag must exist for any non-local destination before an IP datagram can be sent to that address. The IRE_ROUTE host routes are created internally in IP for all nonlocal destinations to improve routing lookup. None of the routes from the netstat rn output will be deleted. ip_ire_gw_probe: Enable dead gateway probes. This option should only be disabled on networks containing gateways which do not respond to ICMP echo requests (ping). [0-1] Default: 1 (probe for dead gateways)
56
ip_ire_gw_probe_interval: Controls the probe interval for Dead Gateway Detection. IP periodically probes active and dead gateways. ip_ire_gw_probe_interval controls the frequency of probing. With retries, the maximum time to detect a dead gateway is ip_ire_gw_probe_interval + 10000 milliseconds. Maximum time to detect that a dead gateway has come back to life is ip_ire_gw_probe_interval. [15000,- ] Default: 180000 (3 minutes) A gateway probe is an ICMP echo request sent by the host to the gateway's IP address. If the gateway has been configured to ignore ICMP echo requests, a host will mistakenly think the gateway is dead, and traffic will stop flowing from the host through the gateway, unless gateway probing is disabled by setting ip_ire_gw_probe to zero. ip_ire_pathmtu_interval: Every ip_ire_pathmtu_interval milliseconds, IP will scan its routing table for entries that have an MTU less than the MTU for the first hop interface. For each, it will increase the value to the next highest value in its internal table of common MTU sizes. In this way, if the path to a remote host has changed, and a larger MTU is now usable, the new MTU will be discovered. If this value is made too small, then excessive lost packets can result. [5000, - ] Default: 600000 (10 minutes)
ip_ire_redirect_interval: All routing table entries resulting from ICMP "Redirect" messages are deleted after this much time has elapsed, whether or not the entry has been recently used. [60000, - ] Default: 300000 (5 minutes) An ICMP "Redirect" is sent to a host when it uses a gateway that believes there is a "better" way for that host to send IP datagrams to their final destination. This occurs when one relies on default routes and there are multiple routers on the local subnet, each going to different destinations. ip_pmtu_strategy: Set the Path MTU Discovery strategy: 0 Disables Path MTU Discovery. For any destination not directly connected to the host, a maximum MTU of 576 is used; 1 Enables Path MTU Discovery; 2 Obsoleted, must not be used; 3 Disables Path MTU Discovery. For any destination not directly connected to the host, the maximum MTU of the link is used. When Path MTU Discovery is enabled all outbound datagrams have the "Don't Fragment" bit set. This should result in notification from any intervening gateway that needs to forward a datagram down a path that would require additional fragmentation. When the ICMP "Fragmentation Needed" message is received, IP updates its MTU for the remote host. If the responding gateway implements the
57
recommendations for gateways in RFC 1191, then the next hop MTU will be included in the "Fragmentation Needed" message, and IP will use it. If the gateway does not provide next hop information, then IP will reduce the MTU to the next lower value taken from a table of "popular" media MTUs. [0,3] Default: 1 Setting the value to one will mean that IP datagrams will always have the DF bit set. This is generally fine, but there are still some broken setups out there that will filter-out ICMP "Fragmentation Needed" messages. Trying to send IP datagrams with the DF bit set through such setups will create a "black hole" beyond which systems are unreachable. Setting the value to zero will mean that TCP will have to fall-back on other strategies to ensure that its segments are not fragmented along the path to the destination. This could result in TCP using a Maximum Segment Size (MSS) smaller than the maximum possible along that path. This can lead to decreased performance. Setting the value to 3 will result in the DF bit in the IP header being cleared, but will still have TCP select an MSS based on the link-local MTU. In effect, it is a way for the network administrator to tell the transport that all (sub)nets are local or that the network administrator is not at all concerned if traffic from this host happens to become fragmented along the way.
ip_reass_mem_limit: Sets an upper bound on the number of bytes IP will use for packet reassembly. If the limit is reached, reassembly lists are purged until the space required for the new fragments becomes available. [-,-] Default: 2000000 bytes It is rare that this value should need to be changed. Most of the time, the successive fragments of an IP datagram will arrive fairly close to one another, and will not sit very long in the reassembly queue. However, if "netstat -p ip" shows a large number of IP datagram fragments being dropped as dup's or for lack of space, you might consider increasing this value. Before you do, check the frequency of such drops so you can compare it to after you make the change. If the frequency remains the same after you have increased the value, it implies that the IP fragment drops were the result of perceived duplicates and not because there was not enough space. If you see a large number of IP datagram fragment drops in the output of "netstat -p ip" you might also want to check the frequency of UDP and TCP checksum failures with netstat -p udp and netstat -p tcp respectively. There are "limitations" in IP version 4 which can lead to reuse of the IP datagram Identifier before fragments can timeout in the reassembly area. This can mean that two otherwise unrelated, fragmented IP datagrams could be reassembled into what one might call a "Frankengram" which will only be detected by the upperlayer checksum. The best course of action should this be happening would be to try to eliminate the fragmentation. The next best course of action would be to try to further restrict the amount of time the system will "hang-on" to an IP datagram fragment. For more discussion on this topic, read the discussion of the ip_fragment_timeout tunable.
ip_send_redirects: Set to 1 to allow IP send ICMP "Redirect" packets; set to 0 to disable. [0,1] Default: 1 (enable)
58
If the system is forwarding IP datagrams, and it is asked to forward a datagram when it knows there is a "better" route, it will send an ICMP "Redirect" message to the source of the datagram to tell it the better route. The system will still forward the datagram. If the value of ip_send_redirects is set to zero, the system will still forward the wayward datagram, but it will not tell the source that there is a better way for it to send its datagrams. ip_send_source_quench: Set to 1 to allow IP send ICMP "Source Quench" packets when it encounters upstream flow control; set to 0 to disable. [0,1] Default:1 (enable)
ip_strong_es_model: Controls the requirement issues related to multihoming as described in RFC1122, Section 3.3.4.2: (A) A host MAY silently discard an incoming datagram whose destination address does not correspond to the physical interface through which it is received. A host MAY restrict itself to sending (non-sourcerouted) IP datagrams only through the physical interface that corresponds to the IP source address of the datagrams.
(B)
When set to 0, it corresponds to the "Weak ES Model" and would therefore substitute MUST NOT for MAY in issues (A) and (B). When set to 1, it corresponds to the "Strong ES Model" and would therefore substitute MUST for MAY in issues (A) and (B). When set to 2, it would substitute MUST NOT for MAY in issue (A) and SHOULD for MAY in issue (B). [0,2] Default: 0 Setting this value to one (1) will have the beneficial effect of allowing (should they be desired) per-interface default routes. It also means that if a packet is received on a given interface, the reply to that packet will be sent-out that interface. This can be useful if one is in the rare situation of needing to have separate physical (in the context of IP - see ip_ill_status) interfaces configured with IP addresses in the same subnet. Generally though, using Auto Port Aggregation (APA) to create one virtual interface with a logical interface for each address is a more robust solution. Also, when ip_strong_es_model is set to a value of one (1), IP datagrams arriving on the "wrong" interface (one that does not have an IP address which matches the IP datagrams' destination IP address) are discarded. If one is using IP address aliases on the loopback (lo0) interface in support of functionality such as hardware load balancer triangle routing, setting ip_strong_es_model to a value of one may result in loss of connectivity for the virtual IP address. Setting this value to two (2) will also give the effect of allowing per-interface default routes. Another feature is that the system will try to send out the best matching interface, like when setting to one (1), but allow the packet to come in on any interface, like when setting to zero (0). This could result in better link utilization which could result in better system performance.
59
IPv6 Tunables
ip6_def_hop_limit: Sets the default value of the Hop Limit field in the IPv6 header. [1,255] Default: 64 The Hop-Limit field of the IPv6 header is used to ensure that an IPv6 datagram eventually "dies" on the network. Each time an IPv6 datagram goes through an IPv6 router, the Hop-Limit is decremented by one hop. When an IPv6 datagram's Hop-Limit reaches zero, it is discarded. This can be rather useful in the face of routing loops - without the decrementing Hop-Limit datagrams would run through the loop forever. Note that this is the value used for generic IPv6 purposes, for example when generating a response to ICMPv6 echo request. The Hop-Limit of IPv6 datagrams containing either UDP datagrams or TCP segments is controlled via udp_def_hop_limit or tcp_ip6_hop_limit respectively. The Hop-Limit used for raw IPv6 access is set via rawip6_def_hop_limit.
ip6_forwarding: Controls how IPv6 hosts forward packets: Set to 0 to inhibit forwarding; set to 1 to always forward when the number of interfaces on the system is 2 or more. [0,1] Default: 1 The ip6_forwarding tunable can be thought-of as the "master forwarding switch." If it is set to zero, no IPv6 datagrams will be forwarded outside the box. If it is set to one, IPv6 datagrams may be forwarded outside the box. The IPv6 forwarding can be controlled per-interface as well (see ifconfig(1M) man page for forward option).
ip6_fragment_timeout: Set the amount of time IPv6 fragments are held while waiting for missing fragments. The IPv6 specification (RFC 2460) specifies 60 seconds as the time-out period for reassembly of IPv6 datagrams. This is a long time, but may be appropriate for reassembling datagrams that have traversed an internet. On local file server systems, on the other hand, fragmentation reassembly will either take place very quickly, or not all, i.e., if all fragments are not received at about the same time, it is likely that one was dropped by the local interface, and will never arrive. In this case, holding fragments for 60 seconds may only exacerbate the problem. For every IPv6 packet that is to be fragmented, the source node generates an Identification value for each fragment. The Identification must be different than that of any other fragmented packet sent recently with the same Source Address and Destination Address. If insufficient fragments are received to complete reassembly of a packet within 60 seconds of the reception of the first-arriving fragment of that packet, reassembly of that packet must be abandoned and all the fragments that have been received for that packet must be discarded. If the parameter is set to a value that is large enough that IPv6 wraps the identification value in its fragments (IPv6 starts to re-use its identification value while holding fragments for reassembly), it is
60
possible that IPv6 will assemble a packet with fragments from different packets. In this case, the problem will be detected only if the upperlayer is validating data integrity (using checksums). The fragment identifier field in IPv6 is 32 bits, allowing a much larger range of values than the 16-bit field in IPv4, and greatly reducing the possibility of wrap within the reassembly timeout interval. With Gigabit Ethernet link, IPv6 fragment identifiers may wrap within approximately 52000 seconds. With a 10 Gigabit Ethernet link, IPv6 fragment identifiers may wrap within 5200 seconds. [100, - ] Default: 60000 (60 seconds)
ip6_icmp_interval: Limit the bandwidth and forwarding costs incurred sending ICMPv6 error messages. This situation may occur when a source sending a stream of erroneous packets fails to heed the resulting ICMPv6 error messages. [0, 10000] Default: 100ms This tunable limits the rate of ICMPv6 error messages that the kernel generates. ip6_ill_status: Displays a report of all IPv6 allocated physical interfaces. This will display interfaces IPv6 believes to be "physical" basically the ":0" interfaces such as lan0:0, which is the same as lan0. This will display interfaces IPv6 believes to be "physical" - basically, the ":0" interfaces (see the manpage for ifconfig(1m) for more) such as lan0:0, which is the same as lan0. However, that may not actually be true physical interface - it could be a virtual interface created by Auto Port Aggregation (APA). Since link trunking is done without the knowledge of IPv6, IPv6 cannot, nor does it need to, distinguish between a "true" physical interface and a virtual interface created through trunking. ip6_ipif_status: Displays a report of all allocated logical interfaces.
A logical interface is created whenever one adds IPv6 addresses to the one already assigned to a "physical" interface or via IPv6 auto-configuration mechanism. These are also sometimes referred to as aliased addresses and are given names such as lan0:1, lan0:2 and so on. These exist only at the IPv6 level; neither the NIC driver or DLPI knows of its existence. While a physical/virtual interface (see the discussion of ip6_ill_status) will appear in the output of lanscan(1M), a logical interface will not. ip6_ire_cleanup_interval: Sets the time-out interval for purging IPv6 routing table entries. All entries unused for this period of time are deleted. [5000, - ] Default: 300000 (5 minutes) Host routes associated with the IRE6_ROUTE flag from the ndd get /dev/ip6 ip6_ire_status output will be deleted. A host route with the IRE6_ROUTE flag must exist for any non-local destination before an IPv6 datagram can be sent to that address. The IRE6_ROUTE host routes are created internally
61
in IPv6 for all non-local destinations to improve routing lookup. None of the routes from the netstat rn output will be deleted. ip6_ire_hash: Displays a report of all routing table entries, in the order searched when resolving an IPv6 address. The IPv6 Internet Routing Entry (IRE6) is the primary data structure that links IPv6 addresses with particular interfaces, attached networks, gateways, and local and remote hosts. The corresponding data structure for IPv4 is an IRE.
ip6_ire_pathmtu_interval: Every "ip6_ire_pathmtu_interval milliseconds", IPv6 will scan its routing table for entries that have an MTU less than the MTU for the first hop interface. For each, it will increase the value to the next highest value in its internal table of common MTU sizes. In this way, if the path to a remote host has changed, and a larger MTU is now usable, the new MTU will be discovered. If this value is made too small, then excessive lost packets can result. [5000, -] Default: 600000 (10 minutes)
ip6_ire_redirect_interval: All routing table entries resulting from ICMPv6 "Redirect" messages are deleted after this much time has elapsed, whether or not the entry has been recently used. An ICMPv6 "Redirect" is sent to a host when it uses a gateway that believes there is a "better" way for that host to send IPv6 datagrams to their final destination. This can come about when one relies on default routes and there are multiple routers on the local network, each going to different destinations. [5000, - ] Default: 300000 (5 minutes)
ip6_ire_status: Displays a report of all IPv6 routing table entries. Same information as in ip_ire_hash, but format and ordering is different.
ip6_reass_mem_limit: Sets an upper bound on the number of bytes IPv6 will use for packet reassembly. If the limit is reached, reassembly lists are purged until the space required for the new fragments becomes available. It is rare that this value should need to be changed. Most of the time, the successive fragments of an IP datagram will arrive fairly close to one another, and will not sit very long in the reassembly queue. However, if "netstat -p ipv6" shows a large number of IPv6
62
datagrams fragments being dropped, you might consider increasing this value. Before you do, check the frequency of such drops so you can compare it to after you make the change. If the frequency remains the same after you have increased the value, it implies that the IPv6 fragment drops were the result of perceived duplicates and not because there was not enough space. If you see a large number of IPv6 datagram fragment drops in the output of "netstat -p ipv6" you might also want to check the frequency of UDP and TCP checksum failures with "netstat -p udp" and "netstat -p tcp" respectively. This can mean that two otherwise unrelated, fragmented IPv6 datagrams could be reassembled into one packet which will only be detected by the upper-layer checksum. The best course of action should this be happening would be to try to eliminate the fragmentation. The next best course of action would be to try to further restrict the amount of time the system will "hang-on" to an IPv6 datagram fragment. For more on that, refer to the discussion of "ip6_fragment_timeout". [-,-] Default: 2000000 bytes This tunable is very similar to ip_reass_mem_limit For IPv6. Please refer ip_reass_mem_limit to for more description.
ip6_send_redirects: Set to 1 to allow IPv6 send ICMPv6 "Redirect" packets; set to 0 to disable. If the system is forwarding IPv6 datagrams, and it is asked to forward a datagram when it knows there is a "better" route, it will send an ICMPv6 "Redirect" message to the source of the datagram to tell it the better route. The system will still forward the datagram. If the value of ip6_send_redirects is set to zero, the system will still forward the wayward datagram, but it will not tell the source that there is a better way for it to send its datagrams. [0,1] Default: 1 (enable) Only IPv6 routers send ICMPv6 redirect message. An ICMPv6 redirect message is used by the routers to inform hosts of a better first hop for a destination.
ip6_tcp_status: Obtains a complete report similar to "netstat -an" on all IPv6 TCP connections.
ip6_udp_status: Obtains a complete report similar to "netstat -an" on all IPv6 UDP connections.
ip6_ire_reachable_interval: The Neighbor Discovery constant REACHABLE_TIME, Section 10 of RFC 2461. This is the base value for computing the random Reachable value. Used with "ip6_min_random_factor" and "ip6_max_random_factor"
63
to set the random timer interval for flushing neighbor cache entries. The Reachable Time is the time a neighbor is considered reachable after receiving a reachability confirmation. The Reachable time value is a uniformly-distributed random value between "ip6_min_random_factor" and "ip6_max_random_factor" times "ip6_ire_reachable_interval" milliseconds. [5000, -] Default: 30000 (30 sec)
It is rare that the following IPv6 values should need to be changed. They are related to IPv6 autoconfiguration and the default values are set based on the standard. ip6_max_random_factor: The Neighbor Discovery constant MAX_RANDOM_FACTOR, Section 10 of RFC 2461. For more on this tunable, refer to the discussion of "ip6_ire_reachable_interval". [100, 200] Default: 150
ip6_min_random_factor: The Neighbor Discovery constant MIN_RANDOM_FACTOR, Section 10 of RFC 2461. For more on this tunable, refer to the discussion of "ip6_ire_reachable_interval" [0, 100] Default: 50
ip6_nd_advertise_count: The Neighbor Discovery constant MAX_NEIGHBOR_ADVERTISEMENT, Section 10 of RFC 2461. This tunable specifies how many neighbor advertisements can be sent. For example, when a node determines that its link-layer address has changed and it may wish to inform its neighbors of the new link-layer address quickly. In such case a node may send up to "ip6_nd_advertise_count" unsolicited Neighbor Advertisement messages to the all-nodes multicast address. These advertisements are separated by at least "ip6_nd_transmit_interval" seconds. [1, 10] Default : 3 transmissions
ip6_nd_dad_solicit_count: The number of duplicate address detection solicitations that are sent. [0, 10] Default: 1 transmission
ip6_nd_multicast_solicit_count: The Neighbor Discovery constant MAX_MULTICAST_SOLICIT, Section 10 of RFC 2461. If no Neighbor Advertisement is received after "ip6_nd_multicast_solicit_count" solicitations, address resolution has failed. The host will return ICMP destination unreachable indications with code 3 (Address Unreachable) for each packet queued awaiting address resolution. [1, 10] Default: 3 transmissions
ip6_nd_probe_delay: The Neighbor Discovery constant DELAY_FIRST_PROBE_TIME, Section 10 of RFC 2461. The first time a node sends a packet to a neighbor whose entry is STALE, the sender changes the state to DELAY and sets a timer to
64
expire in "ip6_nd_probe_delay" seconds. If the entry is still in the DELAY state when the timer expires, the entry's state changes to PROBE. If reachability confirmation is received, the entry's state changes to REACHABLE. [5000, -] Default: 5000 (5 sec)
ip6_nd_transmit_interval: The Neighbor Discovery constant RETRANS_TIMER, Section 10 of RFC 2461. This tunable specifies the time between retransmissions of Neighbor Solicitation messages to a neighbor when resolving the address or when probing the reachability of a neighbor. [1000, -] Default: 1000 (1 sec)
ip6_nd_unicast_solicit_count: The Neighbor Discovery constant MAX_UNICAST_SOLICIT, Section 10 of RFC 2461. A node sends a unicast Neighbor Solicitation message to the neighbor using the cached link-layer address. While in the PROBE state, a node retransmits Neighbor Solicitation messages every "ip6_nd_transmit_interval" milliseconds until reachability confirmation is obtained. Probes are retransmitted even if no additional packets are sent to the neighbor. If no response is received after waiting "ip6_nd_transmit_interval" milliseconds after sending the "ip6_nd_unicast_solicit_count" solicitations, retransmissions cease and the entry will be deleted. Subsequent traffic to that neighbor will recreate the entry and performs address resolution again. [1, 10] Default: 3 transmissions
ip6_rd_solicit_count: The Neighbor Discovery constant MAX_RTR_SOLICITATIONS, Section 10 of RFC 2461. When an interface becomes enabled, it may not be desirable to just wait for the next unsolicited Router Advertisement to locate default routers or learn prefixes. In order to obtain Router Advertisements quickly, a host can transmit up to "ip6_rd_solicit_count" Router Solicitation messages each separated by at least "ip6_rd_transmit_interval" seconds. [1, 10] Default: 3 transmissions ip6_rd_solicit_delay: The Neighbor Discovery constant MAX_RTR_SOLICITATION_DELAY, Section 10 of RFC 2461. Before a host sends an initial solicitation, it delays the transmission for a random amount of time between 0 and "ip6_rd_solicit_delay". This serves to alleviate congestion when many hosts start up on a link at the same time, such as might happen after recovery from a power failure. [1000, -] Default: 1000 (1 sec)
ip6_rd_transmit_interval: The Neighbor Discovery constant RTR_SOLICITATION_INTERVAL, Section 10 of RFC 2461. This tunable specifies the interval of sending Router Solicitation messages. See a related tunable "ip6_rd_solicit_count" for more detail discussion. [1000, -] Default: 4000 (4 sec)
65
Sockets Tunables
socket_buf_max: Specifies the maximum socket buffer size for AF_UNIX sockets. [1024,2147483647] Default: 262144 bytes
socket_caching_tcp: Enables or disables socket caching for TCP sockets for AF_INET and AF_INET6 address families. This value determines how many data structures for TCP sockets the system caches per CPU for each address family. Enabling this feature can improve system performance considerably if the system uses many short-lived connections. The value of 0 (zero) disables the feature. The value of 1 enables the feature and sets the cache size to 512 entries per address family per CPU, which is also the default value. Any value greater than 1 sets the cache size to the specified value. [0-2147483647] Default: 512 (caching) The HP-UX 11i transport builds a "stream" for each connection. On that "stream" are placed STREAMS modules for a stream-head, TCP and IP as well as socket data structures (when BSD/XOPEN sockets are in use). These structures can be cached to accelerate connection establishment. HP-UX 11i v1 and 11i v2 support caching of IPv4 TCP connection only. HP-UX 11i v3 has been enhanced such that both IPv4 and IPv6 TCP connections are cached. For virtually all situations, the default value of 512 is sufficient. Refer to section 4.2.2 for the detailed discussion of Socket caching for TCP connections".
socket_enable_tops: Controls the IP optimization feature named TOPS (Thread-Optimized Packet Scheduling. Modifying TOPS can result in improved scalability of TCP and UDP socket applications on multiprocessor systems. [0-2] Default: 1 (TOPS Mode 1) for 11i v1 and 11i v2 Default: 2 (TOPS Mode 2) for 11i v3 The behavior depends on what value socket_enable_tops is set to: 0: Disables the TOPS feature. 1: Enables the TOPS feature in Mode 1. This parameter setting can help improve systems using a mix of short and long-lived TCP socket connections. 2: Enables the TOPS feature in Mode 2. This parameter setting is the default value for 11i v3 and is optimal for most system configurations as well as systems using infrequent long-lived connections. Refer to section 3.1 for the detailed discussion of TOPS.
66
socket_qlimit_max: Sets maximum number of connection requests for non-AF_INET sockets. [1-2147483647] Default: 4096
socket_udp_rcvbuf_default: Sets the default receive buffer size for UDP sockets. The value of this tunable parameter should not exceed the value of the ndd parameter udp_recv_hiwater_max. Otherwise a socket() call to create UDP socket will fail and return the errno value EINVAL. [1-2147483647] Default: 65535
socket_udp_sndbuf_default: Sets the default send buffer size for UDP sockets. [1-65535] Default: 65535
TCP Tunables
tcp_conn_request_max: Maximum number of outstanding inbound connection requests. [1, - ] Default: 4096 connections This is also known as the maximum depth of the "listen queue." The actual maximum for any given TCP endpoint in the LISTEN state will be the MINIMUM of tcp_conn_request_max and the value the application passed-in to the listen() socket call. For this parameter to take effect for a given application, it must be set before said application makes its call to listen(). So, if you use ndd to set this value after the application has started, it will have no effect unless you can get the application to recreate its LISTEN endpoint(s) typically by stopping and restarting the application. You can see if tcp_conn_request_max might be too small by looking at the output of either netstat -s or netstat -p tcp and looking for the line displaying the number of connection requests dropped due to full queue. If the number of drops is zero, the value of tcp_conn_request_max is fine. If the value is non-zero, either tcp_conn_request_max is too small, or the values the applications are using in their calls to listen() are too small. Setting this value higher has no particular downside especially for any "internet server" system. tcp_cwnd_initial: RFC 2414 defines a formula for calculating the sender's initial congestion window that usually results in a larger window than in previous releases. The default initial congestion window is now calculated using the
67
following formula: min((4 * MSS), max(2 * MSS, 4380)) where MSS is the maximum segment size for the underlying link. With the new congestion window formula, it is possible for TCP to send a large, initial block of data without waiting for acknowledgements. This is useful in networks with large bandwidth and low error rates and particularly useful for short-lived connections that only need to send ~4Kbytes of data or less. To modify the initial congestion window, configure the ndd TCP parameter tcp_cwnd_initial. TCP will calculate the initial congestion window using the following formula: min((tcp_cwnd_initial * MSS), max(2 * MSS, 4380)), [1-4] Default 4: (TCP implements RFC 2414) The congestion window is a window different from the "classic" TCP window. The Classic TCP Window is advertised by the receiving TCP and is communicated in the TCP header. It implements flow control between the sending and receiving TCPs. The congestion window is a "window" tracked by the sending TCP and it is not communicated in the TCP header. It is used by the sending TCP to implement a form of flow control and receiving TCPs. The sending TCP keeps track of how many segments/bytes it can have outstanding on the network before triggering packet loss. At the beginning of a connection a sending TCP does not know how many segments/bytes it can have outstanding, so in the "be conservative in what you send" Internet philosophy it starts with a congestion window of two segments. However, even the initial value of two can be rather limiting for connections which need only send a small quantity of data (eg three or four segments' worth -- say a web server). This parameter takes values of between one (1) and four (4) segments, with a default of four(4). At the beginning of a connection, the sending TCP will send no more than (tcp_cwnd_initial* MSS) or 4380 bytes, whichever is less. After that, it will stop, and await the first ACKs from the remote before sending additional data (if any).
tcp_deferred_ack_max: Upper limit on the number of bytes of data (measured in MSS) that can be received without sending an ACK. [2-32] Default: 22 MSS TCP follows an ACK strategy which is an extension of the "every other segment" approach described in RFC 1122. TCP will defer sending ACKs until it has received ack_cnt >= ack_cur_max bytes. Initially ack_cur_max is equal to 1 MSS. Each time TCP is successfully able to defer sending ACKs for ack_cur_max bytes (that is, no timer expired for unacknowledged data), TCP increases ack_cur_max by 1 MSS, up to an absolute maximum of ack_abs_max. The tunable parameter tcp_deferred_ack_max sets the upper bound for the delay before an ACK is sent.
68
The other events that control the ACK generation are the following: Arrival of out of order segment Time to send a window update as the application consumed enough data. Using delayed ACKs has very little impact on transmission reliability since ACKs are cumulative. Furthermore, delayed ACKs conserve resources by decreasing the load on the network and the cpu that must generate and process these ACK segments. Delayed ACKs have been found to have a positive impact on the performance of bulk transfer.
tcp_delay_final_twh_ack: Piggyback the final ACK of the three way TCP connection establishment with the data by delaying the final ack by 10ms. 0: Send the final ACK as soon as the SYN+ACK packet arrives from the remote host. 1: Delay the sending of the final ACK by 10ms. If there is data available to be sent with in the next 10ms, piggyback the ACK for the SYN. [0-1] Default: 1
tcp_early_conn_ind: If set to 1, a T_CONN_IND message is sent upstream as soon as a SYN packet is received from a remote host which is on the TCP's 'good guy' list. A remote host goes on the 'good guy' list if it is known to have completed the 3-way handshake earlier. If set to 0, the T_CONN_IND message is not sent upstream until the 3-way handshake is complete, even if the remote host is on the 'good guy' list. [0,1] Default: 1
tcp_fin_wait_2_timeout: Determines how long a TCP connection will be in FIN_WAIT_2. Normally one end of a connection initiates the close of its end of the connection (indicates that it has no more data to send) by sending a FIN. When the remote TCP acknowledges the FIN, TCP goes to the FIN_WAIT_2 state and will remain in that state until the remote TCP sends a FIN. If the FIN_WAIT_2 timer is used, TCP will close the connection when it has remained in the FIN_WAIT_2 state for the length of the timer value. The FIN_WAIT_2 timer must be used with caution because when TCP is in the FIN_WAIT_2 state the remote is still allowed to send data. In addition, if the remote TCP would terminate normally (it is not hung nor terminating abnormally) and the connection is closed because of the FIN_WAIT_2 timer, the connection may be closed prematurely. Data may be lost if the remote sends a window update or FIN after the local TCP has closed the connection. In this situation, the local TCP will send a RESET.
69
According to the TCP protocol specification, the remote TCP should flush its receive queue when it receives the RESET. This may cause data to be lost. [0-2147483647 Milliseconds] Default: 0 (indefinite)
The tcp_fin_wait_2_timeout parameter controls a timer that can be used to terminate connections in the FIN_WAIT_2 state. This should only be used in those cases where the tcp_keepalive_detached_interval mechanism is known to not work. This is almost always the result of something like a broken firewall somewhere in the network. Such broken components should be fixed. If, and only if that is not possible, setting tcp_fin_wait_2_timeout can be used as a workaround until the broken components can be fixed. The default value of 0 means that no additional timer is started. A value of > 0 will cause a timer of tcp_fin_wait_2_timeout milliseconds to be started once the connection enters the FIN_WAIT_2 state. This timer will only be activated for the FIN_WAIT_2 state. Other states are covered either by an application setting SO_KEEPALIVE and the tcp_keepalive_interval, or by the TCP retransmission parameters. The timer is only active on connections that go to FIN_WAIT_2 after the ndd command has been run. Any connection that is already in the FIN_WAIT_2 state will not be affected. tcp_frto_enable: Allows the use of the Forward RTO Recovery (F-RTO) algorithm. This attempts to detect when a spurious RTO timeout has occurred and avoids unnecessary retransmissions from the TCP sender. Supported parameter values are: 0: Local system never uses F-RTO. 1: Local system uses F-RTO detection and F-RTO response algorithm. 2: Local system uses F-RTO detection algorithm and a response algorithm based on Congestion Window Validation. [0-2] Default: 0 Refer to section 4.3.2 for the F-RTO discussion in the wireless networks environment.
tcp_ip_abort_cinterval: Second threshold timer during connection establishment. When it must retransmit the SYN packet because a timer has expired, TCP first compares the total time it has waited against two thresholds, as described in RFC 1122, 4.2.3.5. If it has waited longer than the second threshold, TCP terminates the connection. In other words, this is how long TCP will wait before failing a call to connect(). Setting this parameter too low will cause false connection failures. [1000,-] Default: 75000 (75 seconds) This value is used when we will be passively accept()ing the connection. In this case, the "SYN" in question is the SYN|ACK we sent in response to the remote's SYN segment. However, unlike the active connect case, the passive accepting application will not see length of time in system calls change when this value is changed. This is the second of two tcp_ip_notify_cinterval. timer thresholds for embryonic connections. The other is
70
tcp_ip_abort_interval: Second threshold timer for established connections. When it must retransmit packets because a timer has expired, TCP first compares the total time it has waited against two thresholds, as described in RFC 1122, 4.2.3.5. If it has waited longer than the second threshold, TCP aborts the connection. For best results, do not set this parameter lower than tcp_time_wait_interval. In addition, you should not set this parameter to be less than four minutes (240000 ms) so that the port number is not re-used prematurely. [500,-] Default: 600000 (10 minutes) This setting will only terminate a connection which is actively trying to send data to the other end. It will have no effect on a connection which is idle and waiting to receive data. Aborting a TCP connection will bypass the TIME_WAIT state, which is an integral part of TCP's correctness algorithms. To be absolutely compliant with the spirit of the RFCs, the TIME_WAIT state should last for four minutes. Given that an aborted connection will not go through TIME_WAIT, it is best that it too sit for at least four minutes. This is the second of two timer thresholds for established connections. The first threshold is tcp_ip_notify_interval the point at which TCP will tell IP that it thinks the current route to the destination might not be any good.
tcp_ip_notify_cinterval: First threshold timer during connection establishment. If the first threshold is exceeded, TCP notifies IP that it is having trouble with the current connection establishment and requests IP to delete the routing table entry for this destination. The assumption is that if no SYN-ACK has been received for an extended period of time, there may be network routing problems and IP should try to find a new route. [1000,-] Default: 10000 (10 seconds) This value is used both when applications actively establish connections via connect(), and when applications will be passively accept()'ing connections.
tcp_ip_notify_interval: First threshold timer for established connections. If the first threshold is exceeded, TCP notifies IP that it is having trouble with the current established connections and requests IP to delete the routing table entry for this destination. The assumption is that if no ACK has been received for an extended period of time, there may be network routing problems and IP should try to find a new route. [500,-] Default: 10000 (10 seconds)
71
tcp_ip_ttl: TTL value inserted into IP header for TCP packets only. [1, 255] Default: 64 A default value of 64 means that TCP will not communicate with any system that is more than 64 hops (routers) away. This should be sufficient for 99.999% of all cases. However, increasing this value to 255 would have no downside in a non-error case. It would make things a bit worse in the case of a routing loop (router A sends to router B which sends back to router A and the like) as it would take that many more trips through the loop to "kill-off" the IP datagram containing that TCP segment.
tcp_ip6_hop_limit: Hop-Limit value inserted into IPv6 header for TCP packets only. [1, 255] Default: 64
tcp_keepalive_detached_interval: Interval for sending keep-alive probes when TCP is detached. When a TCP instance exists without a corresponding stream or queue, that TCP instance is referred as "detached". When a stream closes before the TCP FIN-ACK handshake is complete, TCP must retain information stored in its instance data to complete the handshake. If the ACK to terminate a connection in FIN_WAIT_2 state never arrives, then a detached TCP instance could remain allocated indefinitely with no way to delete it. To avoid this potential problem, a keep-alive timer is started when a TCP stream closes before the TCP state is CLOSED. The time-out interval is tcp_keepalive_detached_interval. If this timer expires, the keep-alive algorithm will ensure that the TCP instance will eventually be deleted. [10000,10*24*3600000] Default: 120000 (2 minutes) Here "without a stream or queue" (ie detached) in essence means "no socket." That is to say, the application has close()'d the socket. This value determines when keepalive probes will start for a detached connection in FIN_WAIT_2. So long as the remote (or something taking responsibility for the remote - for instance a firewall) responds to the keepalive probes, the connection will remain after all, if the firewall is still responding when the remote is really gone, the firewall is broken and must be fixed. Once keepalive probes start, they will be retransmitted, and the connection possibly aborted, based on the regular TCP retransmission timer settings. tcp_keepalive_detached_interval is only for detached connections in the FIN_WAIT_2 or LAST_ACK state. The classic TCP keepalive mechanism for connections in other active states and enabled by a call to setsockopt() is controlled via tcp_keepalive_interval.
72
tcp_keepalive_interval: Interval for sending keep-alive probes. If any activity has occurred on the connection or if there is any unacknowledged data when the time-out period expires, the timer is simply restarted. If the remote system has crashed and rebooted, it will presumably know nothing about this connection, and it will issue an RST in response to the ACK. Receipt of the RST will terminate the connection. If the keepalive packet is not ACK'd by the remote TCP, the normal retransmission time-out will eventually exceed threshold R2, and the connection will be terminated. With this keepalive behavior, a connection can time-out and terminate without actually receiving an RST from the remote TCP. [10000, 10*24*3600000] Default: 2 * 3600000 (2 hours) These keepalives for an established connection will only be enabled if the application uses a setsockopt()/t_optmgt() call to enable keepalives (SO_KEEPALIVE). It is not enabled otherwise. This differs from the behavior of tcp_keepalive_detached_interval. A large number of idle connections, with keepalives enabled, could generate more keepalive traffic than real traffic. They could also keep on-demand/dial-up links open unnecessarily.
tcp_largest_anon_port: Largest anonymous port number to use. [1024, 65535] Default: 65535 It is unlikely that this will ever need to be changed. The only reason would be to provide some similarity to anonymous (aka ephemeral) port number assignments of older BSDish stacks. There are no checks made to ensure that tcp_largest_anon_port is set to a sane value wrt to the smallest anonymous port number. Setting tcp_largest_anon_port to a value smaller than tcp_smallest_anon_port will lead to undefined behavior.
tcp_recv_hiwater_def: The maximum size for the receive window. [4096,-] Default: 32768 bytes This can also be thought-of as the default receive socket buffer (aka SO_RCVBUF) size. It is used for any connection using an interface that is NOT marked as either a "long, fat pipe" (LFP) or a "long, narrow pipe" (LNP). One of the "limits" to the performance of a TCP connection is based on the relationship between the TCP window (W) and the round-trip-time (RTT) of the connection. Basically, a sending TCP cannot send more than one window's worth of data before it must stop and wait for a "window update" from the remote. The soonest a window update can arrive is one RTT. From this we have the function: Throughput <= W/RTT
73
A window size of 32768 bytes is enough to allow 10 Mbit/s of throughput out to a RTT of roughly 25 milliseconds. It would allow 100 Mbit/s of throughput out to roughly 2.5 milliseconds, and 1000 Mbit/s of throughput out to roughly 0.25 milliseconds. Typically, the round-trip time on a local LAN is <= 1 millisecond. For a terrestrial (no satellites) link across the continental US the RTT is anywhere between 30 and 100 milliseconds, though it can be higher. If there is a satellite hop involved, the RTT will be a minimum of roughly 250 milliseconds for each satellite hop, and if a satellite hop is used in both directions, that means 500 milliseconds. For a WAN the RTT is (generally) more a function of distance and the speed of light than it is a function of bit-rate - the RTT for a satellite 1.5 Mbit/s link and a 45 Mbit/s link would be about the same. tcp_recv_hiwater_lfp: The maximum size for the receive window for "long, fat pipe" interfaces such as Fibre Channel which provide high bandwidth with high latency. [4096,-] Default: 65536 bytes
tcp_recv_hiwater_lnp: The maximum size for the receive window for "long, narrow pipe" interfaces such as PPP over a 56 kb modem which provide low bandwidth with high latency. [4096,-] Default: 8192 bytes
tcp_rexmit_interval_initial: Initial value for round trip time-out, from which the retransmit time-out is computed. [1,20000] Default: 3000ms When TCP first sends a segment to the remote, it has no history about the round-trip times. So, it will use tcp_rexmit_interval_initial as a best guess for its first retransmit timeout setting. Setting this value too small will result in spurious retransmissions and could result in poor initial connection performance. If this value is set too large, TCP may not be able to send many retransmissions before it reaches the tcp_ip_abort_interval or tcp_ip_abort_cinterval and could result in a spurious connection abort.
tcp_rexmit_interval_initial_lnp: Same as tcp_rexmit_interval_initial, but used for devices with the LNP (Long Narrow Pipe) flag set. [1,20000] Default: 3000ms
tcp_rexmit_interval_max: Upper limit for computed round trip time-out. [1,7200000] Default: 60000 (1 minute) It should almost never be the case that this is in need of changing. If you make the value larger, there will be fewer total retransmissions. Unless you also increase tcp_ip_abort_interval, this could lead to an increase in "false positives" on connection failures. Setting this value too low could also contribute to what is called "congestive collapse" of the network where the network is ~100% occupied with retransmissions instead of new data. It would do this by
74
preventing TCP from being able to back-off its retransmission timer far enough to be at or above the actual round-trip time of the network.
tcp_rexmit_interval_min: Lower limit for computed round trip time-out. Unless you know that all TCP connections from the system are going through links where the RTT is greater than 500 milliseconds, and are also _highly_ variable in their RTTs you should not increase this value. If you increase this value, and there are actual packet losses, you could have a drop in throughput as TCP could sit longer waiting for a TCP retransmission timer to expire. Similarly, unless you have concrete proof that the RTT is never more than 500 milliseconds, you should not decrease this value as it could result in spurious TCP retransmissions which could also decrease throughput by keeping the TCP congestion window artificially small. [1,7200000] Default: 500ms
tcp_sack_enable: Allows recipients to selectively acknowledge out-of-sequence data. The TCP sender can then retransmit only the lost segments and adjust its send window to reflect the actual amount of received data. The use of TCP SACK is controlled by the ndd parameter tcp_sack_enable. Supported parameter values are: 0: Local system never uses SACK. 1: Local system sends the SACK option in SYN packet 2: Local system enables SACK if remote system negotiates the use of SACK in SYN packet (default) Negotiation of the use of SACK is done by sending TCP SYN packets with an Option Kind value of 4 to indicate that the system can receive (and process) SACKs. TCP packets with SACK information will have an Option Kind value of 5. [0-2] Default: 2
SACK or "Selective ACKnowledgment" is a TCP feature intended to increase performance for data transfers over lossy links. It accomplishes this by extending TCP's original, simple "ACK to the first hole in the data" algorithm with one that can describe which segments past the first lost segment are missing. This information, sent from the receiver to the sender, allows the sender to both retransmit lost segments sooner, and avoid retransmission of segments which were not lost (those segments received, coming after the first hole in the sequence space). The default value is a somewhat conservative value of two (2) - the system will not initiate the use of SACK. However, if the remote initiates the connection to us, and asks for SACK, we will honour that request. A value of one (1) should be used if you want the system to use SACK for those connections initiated from the system itself (i.e. applications on the system calling connect() or t_connect()).
75
It is very unlikely that a value of zero (0) would ever be indicated. One unlikely case would be when one knows that severely bandwidth constrained links are in use and the additional bytes of the SACK option would limit effective bandwidth. tcp_sth_rcv_hiwat: If nonzero, sets the Stream-head flow control high water mark. [0,128000] Default: 0 The stream head flow control high water mark is set to larger of tcp_sth_rcv_hiwat or the receive window of the connection. The default value of 0 means that the high water mark will be set to the receive window of the connection. tcp_sth_rcv_lowat: If nonzero, sets the Stream-head flow control low water mark. [0,128000] Default: 0
tcp_syn_rcvd_max: Controls the SYN attack defense of TCP. The value specifies the maximum number of suspect connections that will be allowed to persist in SYN_RCVD state. For SYN attack defense to work, this number must be large enough so that a legitimate connection will not age out of the list before an ACK is received from the remote host. This number is a function of the speed at which bogus SYNs are being received and the maximum round trip time for a valid remote host. This is very difficult to estimate dynamically, but the default value of 500 has proven to be highly effective. [1,10000] Default: 500 connections
tcp_status: Obtains a complete report similar to "netstat -an" on all TCP instances. Requests for this report through concurrent execution of ndd instances are serialized through semaphore. Hence tcp_status report invocation through ndd may appear to hang incase there is an ndd instance generating tcp_status/udp_status report already running on the system.
tcp_time_wait_interval: Amount of time TCP endpoints persist in TCPS_TIME_WAIT state. [1000,600000] Default: 60000 (60 seconds) The TIME_WAIT interval is an integral part of TCP's correctness algorithms. TCP connections are "named" (uniquely identified) by the four-tuple of local and remote IP address, and local and remote TCP port number. There is no concept of "this is the N'th connection by this name." So to prevent TCP segments from an old connection being accepted on a new connection, TCP uses the TIME_WAIT state. This preserves TCP information long enough to be statistically certain that all the segments of the old TCP connection by that name are gone.
76
The HP-UX TCP stack can track literally millions of TIME_WAIT connections with no particular decrease in performance and only a slight cost in terms of memory. So, it should almost never be the case that you need to decrease this value from its default of 60 seconds. tcp_ts_enable: RFC 1323 defines a timestamps option that can be sent with every segment. The timestamps in the option are used for two purposes: More accurate RTTM (Round Trip Time Measurement), or the interval between time a TCP segment is sent and the time return acknowledgement arrives. PAWS (Protect Against Wrapped Sequences) on very high-speed networks. On connections with large transmission rates where the sequence number may wrap, the timestamps are used to detect old packets. Supported parameter values are: 0: Never timestamp 1: Always initiate 2: Allow but don't initiate (Default) Use of timestamps is requested by the initiator of a TCP connection by sending a timestamps option (Option Kind 8) in the initial TCP SYN packet. [0-2] Default: 2
Timestamps are part of TCP's optional support for large windows - windows larger than 65535 bytes. With larger windows, but the same size TCP sequence number space, it becomes possible to "wrap" the sequence number before an old segment with that same sequence number is statistically known to have left the network. So, timestamps essentially increase the effective sequence number space - two sends with the same sequence number will have a different timestamp. These timestamps are to be echoed back by the receiver in the ACKs its sends back to the sender. This information can be used by the sender to get a more accurate picture of the round-trip-time between the two ends of the connection. This can result in quicker accurate and fewer spurious retransmission timeouts. The default value is two (2) - do not ask for, but accept timestamp options. Basically, if the remote initiates a connection to the local system, and asks for timestamps, they will be used. Otherwise, for connections initiated by the local system, timestamps will not be requested. Again, this is one of those "conservative in what you send" defaults. A value of one (1) means that the system will ask for timestamps on connections it initiates, and will accept the use of timestamps on connections initiated by remote systems. A value of zero (0) means that the system will never ask for timestamps on connections it initiates, nor will it accept the use of timestamps on connections initiated by remote systems. This value would likely only be used when the added option bytes were consuming too much bandwidth. Timestamps should always be used if one is going to use windows larger than 65535 bytes. So, if a system is configured with a tcp_xmit_hiwater_* or tcp_recv_hiwater_* larger than 65535 bytes, tcp_ts_enabled should be set to one (1). To be safe, timestamps should also be used anytime a single connection will be able to send data faster than one GB per minute. The rationale here is that we do not want a TCP connection wrapping its 4 GB
77
sequence number space within 2MSL. The suggested-by RFC's value for 2MSL would be four minutes, hence timestamps should be used whenever a TCP connection is going to run at sustained rates of more than 1 GB per minute, which is ~17 MB/s or ~145 Mbit/s. Admittedly, this is pretty conservative.
tcp_tw_cleanup_interval: Interval in milliseconds between checks to see if TCP connections in TIME_WAIT have reached or exceeded the tcp_time_wait_interval. Reducing the length of this interval increases the precision of the timeout. If tcp_time_wait_interval ia set to 60 seconds (the default) and the tcp_tw_cleanup_interval is also 60 seconds, a TCP connection may spend as little as 60 seconds, or as much as two minutes in TIME_WAIT. [10000-300000] Default: 60000
tcp_xmit_hiwater_def: The amount of unsent data that triggers write-side flow control. [4096,-] Default: 32768 bytes This can be thought of as the default send socket buffer size (SO_SNDBUF). Once this many bytes of data are queued to the connection (actually, it tends to be more than this value) any subsequent send() (etc) call with either block or return EWOULDBLOCK/EAGAIN if the socket was marked non-blocking. Increasing this value allows an application to put more data into the connection at one time. This can be useful to minimize the number of send()/write() system calls required to get the data into the transport. Also, it allows the transport a better chance at taking full advantage of the remote TCP's advertised window, which might otherwise be limited by flow control on the sending application. See the discussion of tcp_recv_hiwater_def for a discussion of the limit to throughput based on window size and RTT - the same idea holds. Having this greater quantity of data in flight allows greater throughput in the face of occasional packet loss. There is a slight chance that too large a value will put more data out onto the network than the network can hold and actually cause packet losses. tcp_xmit_hiwater_lfp: The amount of unsent data that triggers write-side flow control for fast links.[4096,-] Default: 65536 bytes
tcp_xmit_hiwater_lnp: The amount of unsent data that triggers write-side flow control for slow links.[4096,-] Default: 8192 bytes
tcp_xmit_hiwater_max: Limits the send buffer size for TCP sockets or communication endpoints specified in a SO_SNDBUF option of a setsockopt() call or XTI_SNDBUF option in a t_optmgmt() call.
78
A setsockopt() call with a SO_SNDBUF option that exceeds the corresponding kernel parameter value will fail and return the errno value EINVAL. A t_optmgmt() call with an XTI_SNDBUF option that exceeds the corresponding kernel parameter value will fail and return the t_errno value TBADOPT. [1024-2147483647] Default: 2147483647 bytes
This tunable can be used to limit the maximum value for tcp_xmit_hiwater_* and for the value passedin via setsockopt() for SO_SNDBUF (similarly XTI_SNDBUF for t_optmgt). In so doing, the administrator can make sure that no one application can monopolize the memory of the system by asking for a very large socket buffer and then filling it. This tunable can also be seen as a way to limit the bandwidth consumed by a TCP application based on the TCP throughput limit of window-size / round-trip-time.
tcp_xmit_lowater_def: The amount of unsent data that relieves write-side flow control. [2048,-] Default: 8192 bytes
tcp_xmit_lowater_lfp: The amount of unsent data that relieves write-side flow control for fast links.[2048,-] Default: 16384 bytes
tcp_xmit_lowater_lnp: The amount of unsent data that relieves write-side flow control for slow links.[4096,-] Default: 2048 bytes
UDP Tunables
udp_def_ttl: Default Time-to-Live inserted into IP header. [1,255] Default: 64 This behaves just as tcp_ip_ttl, only for the IPv4 datagrams containing UDP datagrams.
udp_def_hop_limit: Default Hop-Limit inserted into IPv6 header. [1,255] Default: 64 This behaves just as tcp_ip6_hop_limit, only for the IPv6 datagrams containing UDP datagrams.
79
udp_largest_anon_port: Largest port number to use for anonymous bind requests. [1024,65535] Default: 65535 This is analogous to the tcp_largest_anon_port tunable. udp_status: Obtain UDP information report similar to "netstat -an". Requests for this report through concurrent execution of ndd instances are serialized through semaphore. Hence udp_status report invocation through ndd may appear to hang incase there is an ndd instance generating tcp_status/udp_status report already running on the system.
udp_recv_hiwater_max: Limits the receive buffer size for TCP and UDP sockets or communication endpoints specified in a SO_RCVBUF option of a setsockopt() call or XTI_RCVBUF option in a t_optmgmt() call. A setsockopt() call with a SO_RCVBUF option that exceeds the corresponding kernel parameter value will fail and return the errno value EINVAL. A t_optmgmt() call with an XTI_RCVBUF option that exceeds the corresponding kernel parameter value will fail and return the t_errno value TBADOPT. A socket() call to create a UDP socket will fail and return the errno value EINVAL if the value of the ndd parameter socket_udp_rcvbuf_default exceeds the value of udp_recv_hiwater_max [1024-2147483647] Default: 2147483647 bytes
This tunable can be used to limit the maximum value for udp_recv_hiwater_max and for the value passed-in via setsockopt() for SO_RCVBUF (similarly XTI_RCVBUF for t_optmgt). In doing so, the administrator can make sure that no one application can monopolize the memory of the system by asking for a very large socket buffer and then filling it. The default and maximum value for this parameter is 2147483647 (231) bytes.
80
Table 1: Summary of TCP/IP Tunables

The following TCP/IP tunables may be queried or set using ndd(1M). All tunables are global, i.e., they affect all TCP/IP connections. Note that some tunables take effect immediately, while others used to initialize TCP/IP connection will only affect newly opened connections. See Appendix B for the detailed description of these tunables.
Tunable Name TCP: tcp_conn_request_max
Description Max number of outstanding connection requests
Reference 4.1.1.2 4.2.4 Appendix A 2.4
tcp_cwnd_initial tcp_deferred_ack_max
Initial size of the congestion window as a multiple of the MSS Upper limit on the number of bytes of data (measured in MSS) that can be received without an ACK. Delay final ack for 3-way handshake connection establishment Controls when a T_CONN_IND message is sent upstream Length of time a TCP connection spends in FIN_WAIT_2 Enable Forward RTO Recovery R2 during connection establishment R2 for established connection R1 during connection establishment R1 for established connection TTL value inserted into IP header Send keepalive probes for detached TCP Interval for sending keepalive probes Largest anonymous port number to use Default receive window size Upper bound on TCP receive buffer size Default receive window size for fast links Default receive window size for slow links Initial value for round trip time-out tcp_rexmit_interval_initial for LNP
tcp_delay_final_twh_ack tcp_early_conn_ind tcp_fin_wait_2_timeout tcp_frto_enable tcp_ip_abort_cinterval tcp_ip_abort_interval tcp_ip_notify_cinterval tcp_ip_notify_interval tcp_ip_ttl tcp_keepalive_detached_interval tcp_keepalive_interval tcp_largest_anon_port tcp_recv_hiwater_def tcp_recv_hiwater_max tcp_recv_hiwater_lfp tcp_recv_hiwater_lnp tcp_rexmit_interval_initial tcp_rexmit_interval_initial_lnp
4.3.2 4.3.1 4.1.1.4 Appendix A 4.3.1
Appendix A Appendix A 4.1.1.5 Appendix A
2.1, 5.3 5.3 5.3 5.3 4.3.1 Appendix A
81
Tunable Name tcp_rexmit_interval_max tcp_rexmit_interval_min tcp_sack_enable tcp_smoothed_rtt tcp_sth_rcv_hiwat tcp_sth_rcv_lowat tcp_syn_rcvd_max tcp_status tcp_time_wait_interval tcp_ts_enable tcp_tw_cleanup_interval tcp_xmit_hiwater_def
Description Upper limit for computed round trip timeout Lower limit for computed round trip timeout Enable TCP Selective Acknowledgement (RFC 2018) An alternate method for computing round trip time Sets the flow control high water mark Sets the flow control low water mark Controls the SYN attack defense of TCP Get netstat-like TCP instances information How long a connection persists in TIME_WAIT Enable TCP timestamp option TIME_WAIT timeout expiration checking interval The amount of unsent data that triggers TCP flow control
Reference Appendix A 4.3.1 Appendix A 2.2 Appendix A
Appendix A
4.1.1.1
2.1 4.1.1.3 4.1.2.1 5.3 5.3 5.3 5.3 5.3 5.3 5.3
tcp_xmit_hiwater_lfp tcp_xmit_hiwater_lnp tcp_xmit_hiwater_max tcp_xmit_lowater_def tcp_xmit_lowater_lfp tcp_xmit_lowater_lnp
The amount of unsent data that triggers TCP flow control for fast links The amount of unsent data that triggers TCP flow control for slow links Upper bound on TCP send buffer The amount of unsent data that relieves TCP flow control The amount of unsent data that relieves TCP flow control for fast links The amount of unsent data that relieves TCP flow control for slow links
IPv4: ip_def_ttl ip_forward_directed_broadcasts ip_forward_src_routed ip_forwarding ip_fragment_timeout ip_ill_status ip_ipif_status ip_ire_hash Controls the default TTL in the IP header Controls subnet broadcasts packets Controls forwarding of source routed packets Controls how IP hosts forward packets Controls how long IP fragments are kept Displays a report of all physical interfaces Displays a report of all logical interfaces Displays all routing table entries, in the order Appendix A
82
Tunable Name ip_ire_status ip_ire_cleanup_interval ip_ire_flush_interval ip_ire_gw_probe ip_ire_gw_probe_interval ip_ire_pathmtu_interval ip_pmtu_strategy ip_reass_mem_limit ip_send_redirects ip_send_source_quench ip_strong_es_model
Description searched when resolving an address Displays all routing table entries Timeout interval for purging routing entries Routing entries deleted after this interval Enable dead gateway probes Probe interval for Dead Gateway Detection Controls the probe interval for PMTU Controls the Path MTU Discovery strategy Maximum number of bytes for IP reassembly Sends ICMP 'Redirect' packets Sends ICMP 'Source Quench' packets Controls multihoming
Reference
Appendix A 6.1.2 Appendix A
IPv6: ip6_def_hop_limit ip6_fragment_timeout ip6_icmp_interval ip6_ill_status ip6_ipif_status ip6_ire_cleanup_interval ip6_ire_hash ip6_ire_pathmtu_interval ip6_ire_redirect_interval ip6_ire_status ip6_reass_mem_limit ip6_send_redirect ip6_tcp_status ip6_udp_status ip6_ire_reachable_interval ip6_max_random_factor ip6_min_random_factor Controls the default Hop Limit in the IPv6 packets Controls how long IPv6 fragments are kept Limits the sending rate of ICMPv6 error messages Displays a report of all IPv6 physical interfaces Displays a report of all IPv6 logical interfaces Timeout interval for purging IPv6 routing entries Displays all IPv6 routing table entries Controls the probe interval for IPv6 PMTU Controls IPv6 'Redirect' routing table entries Displays all IPv6 routing table entries Maximum number of bytes for IPv6 reassembly Sends ICMPv6 'Redirect' packets Reports IPv6 level TCP fanout table Reports IPv6 level UDP fanout table Controls the ND REACHABLE_TIME Controls the ND MAX_RANDOM_FACTOR Controls the ND MIN_RANDOM_FACTOR 6.1.2 Appendix A
83
Tunable Name ip6_nd_advertise_count ip6_nd_dad_solicit_count ip6_nd_multicast_solicit_count ip6_nd_probe_delay ip6_nd_transmit_interval ip6_nd_unicast_solicit_count ip6_rd_solicit_count ip6_rd_solicit_delay ip6_rd_transmit_interval
Description Controls the ND MAX_NEIGHBOR_ADVERTISEMENT Controls the number of duplicate address detection Controls the ND MAX_MULTICAST_SOLICIT Controls the ND DELAY_FIRST_PROBE_TIME Controls the ND RETRANS_TIMER Controls the ND MAX_UNICAST_SOLICIT Controls the ND MAX_RTR_SOLICITATIONS Controls the ND MAX_RTR_SOLICITATIONS_DELAY Controls the ND RTR_SOLICITATION_INTERVAL
Reference
Sockets: socket_buf_max socket_caching_tcp socket_enable_tops socket_msgeof socket_qlimit_max socket_udp_rcvbuf_default Sets maximum socket buffer size for AF_UNIX sockets Controls socket caching for TCP sockets Controls the TOPS optimization feature Enables the MSG_EOF feature Sets maximum number of connection requests for non-AF_UNIX sockets Sets the default receive buffer size for UDP sockets Set the default send buffer size for UDP sockets 4.1.2.2 6.1.2 Appendix A 4.1.2.3 4.2.2 3.1.2 4.2.5
socket_udp_sndbuf_default
84
Table 2: Operating System Support for TCP/IP Tunables

Table 2 provides the following information about which versions of the operating system support the TCP/IP tunables described in this document. An * (asterisk) specifies that the version supports the tunable and does not require any patch. A
(dash) specifies that the version does not support the tunable.
Tunable Name TCP: tcp_conn_request_max tcp_cwnd_initial tcp_deferred_ack_max tcp_delay_final_twh_ack
11i v1
11i v2
11i v3
* * * Patch Level PHNE_35351 or higher Patch Level PHNE_28089 or higher * Patch Level PHNE_35351 or higher * * * * * * * * * * * * * * *
* * * Patch Level PHNE_35765 or higher *
* * * *
tcp_early_conn_ind
tcp_fin_wait_2_timeout tcp_frto_enable
* Patch Level PHNE_35765 or higher * * * * * * * * * * * * * * *
* *
tcp_ip_abort_cinterval tcp_ip_abort_interval tcp_ip_notify_cinterval tcp_ip_notify_interval tcp_ip_ttl tcp_keepalive_detached_interval tcp_keepalive_interval tcp_largest_anon_port tcp_recv_hiwater_def tcp_recv_hiwater_max tcp_recv_hiwater_lfp tcp_recv_hiwater_lnp tcp_rexmit_interval_initial tcp_rexmit_interval_initial_lnp tcp_rexmit_interval_max
* * * * * * * * * * * * * * *
85
Tunable Name tcp_rexmit_interval_min tcp_sack_enable tcp_smoothed_rtt tcp_sth_rcv_hiwat tcp_sth_rcv_lowat tcp_syn_rcvd_max tcp_status tcp_time_wait_interval tcp_ts_enable tcp_tw_cleanup_interval tcp_xmit_hiwater_def tcp_xmit_hiwater_lfp tcp_xmit_hiwater_lnp tcp_xmit_hiwater_max tcp_xmit_lowater_def tcp_xmit_lowater_lfp tcp_xmit_lowater_lnp
11i v1 * * * * * * * * * * * * * * * * *
11i v2 * * * * * * * * * * * * * * * * *
11i v3 * * * * * * * * * * * * * * * * *
IPv4: ip_def_ttl ip_forward_directed_broadcasts ip_forward_src_routed ip_forwarding ip_fragment_timeout ip_ill_status ip_ipif_status ip_ire_hash ip_ire_status ip_ire_cleanup_interval ip_ire_flush_interval ip_ire_gw_probe ip_ire_gw_probe_interval ip_ire_pathmtu_interval * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
86
Tunable Name ip_pmtu_strategy ip_reass_mem_limit ip_send_redirects ip_send_source_quench ip_strong_es_model
11i v1 * * * * *
11i v2 * * * * *
11i v3 * * * * *
IPv6: ip6_def_hop_limit ip6_fragment_timeout ip6_icmp_interval ip6_ill_status ip6_ipif_status ip6_ire_cleanup_interval ip6_ire_hash ip6_ire_pathmtu_interval ip6_ire_redirect_interval ip6_ire_status ip6_reass_mem_limit ip6_send_redirect ip6_tcp_status ip6_udp_status ip6_ire_reachable_interval ip6_max_random_factor ip6_min_random_factor ip6_nd_advertise_count ip6_nd_dad_solicit_count ip6_nd_multicast_solicit_count ip6_nd_probe_delay ip6_nd_transmit_interval ip6_nd_unicast_solicit_count ip6_rd_solicit_count ip6_rd_solicit_delay ip6_rd_transmit_interval IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot IPv6NCF11i depot * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
87
Tunable Name Sockets: socket_buf_max socket_caching_tcp socket_enable_tops
11i v1 * * Patch Level PHNE_33159 or higher
11i v2 * * Patch Level PHNE_33798 or higher
11i v3 * * Patch Level PHNE_36281 or higher Patch Level PHNE 36281 or higher * * *
socket_msgeof socket_qlimit_max socket_udp_rcvbuf_default socket_udp_sndbuf_default
* * *
* * *
88
Revision History
Periodically, this document is updated as new information becomes available. This documents revision history is as follows: Version 1.0 August, 2007 Version 1.1 March, 2008
2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. Itanium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries.

C 02020743

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

C 02020743

Hochgeladen von

Copyright:

Verfügbare Formate

HP-UX 11i TCP/IP Performance White Paper

1.1 Intended Audience

1.2 Organization of the document

1.3 Related Documents

Out of the Box TCP/IP Performance Features for HP-UX Servers

2.2 Selective Acknowledgement (RFC 2018)

2.3 Limited Transmit (RFC 3042)

2.4 Large Initial Congestion Window (RFC 3390)

2.5 TCP Segmentation Offload (TSO)

2.6 Packet Trains for IP fragments

err memavail ilmcnt name

00000000517990a8 000000005001e400 000000005001e580 RUNNING BROADCAST CKO MULTICAST CKO_IN TRAIN

Advanced Out of the Box Scalability and Performance Features

3.2 STREAMS NOSYNC Level Synchronization

3.3 Protection from Packet Storms

3.4 Interrupt Binding and Migration

Improving HP-UX Server Performance

4.1 Tuning Application and Database Servers

Enterprise Data Center

Enterprise Data Center

Firewall Load balancer App Server Web Server DB Server

4.1.1 Tuning Application Servers

4.1.2 Tuning Database Servers

4.2 Tuning Web Servers

4.2.1 Network Server Accelerator HTTP

4.2.2 Socket Caching for TCP Connections

4.2.3 Tuning tcphashsz

4.2.4 Tuning the listen queue limit

4.2.5 Using MSG_EOF flag for TCP Applications

/* send a request with MSG_EOF */ /* receive a response */

/* receive a request */ /* send a response with MSG_EOF */

4.3 Tuning Servers in Wireless Networks

Tuning Applications Using Programmatic Interfaces

5.2 Polling Events

5.3 send() and recv() Socket Buffers

5.4 Effective use of the listen backlog value

Monitoring Network Performance

6.1 Monitoring network statistics

Monitoring TCP connections with netstat an

0 0 881 10832833 132612 140 0 0 2 655367

6.2 Monitoring System Resource Utilization

6.3 Measuring Network Throughput

6.4 Additional Monitoring Tools

Appendix A: Annotated output of netstat s (TCP, UDP, IP, ICMP)

Appendix B: Annotated output of ndd h and discussions of the TCP/IP tunables

Table 1: Summary of TCP/IP Tunables

Tunable Name TCP: tcp_conn_request_max

Description Max number of outstanding connection requests

Reference 4.1.1.2 4.2.4 Appendix A 2.4

4.3.2 4.3.1 4.1.1.4 Appendix A 4.3.1

Appendix A Appendix A 4.1.1.5 Appendix A

2.1, 5.3 5.3 5.3 5.3 4.3.1 Appendix A

Reference Appendix A 4.3.1 Appendix A 2.2 Appendix A

tcp_xmit_hiwater_lfp tcp_xmit_hiwater_lnp tcp_xmit_hiwater_max tcp_xmit_lowater_def tcp_xmit_lowater_lfp tcp_xmit_lowater_lnp

Appendix A 6.1.2 Appendix A

Table 2: Operating System Support for TCP/IP Tunables

Tunable Name TCP: tcp_conn_request_max tcp_cwnd_initial tcp_deferred_ack_max tcp_delay_final_twh_ack

* * * Patch Level PHNE_35765 or higher *

* Patch Level PHNE_35765 or higher * * * * * * * * * * * * * * *

Tunable Name ip_pmtu_strategy ip_reass_mem_limit ip_send_redirects ip_send_source_quench ip_strong_es_model

Tunable Name Sockets: socket_buf_max socket_caching_tcp socket_enable_tops

11i v1 * * Patch Level PHNE_33159 or higher

/* send a request with MSG_EOF / / receive a response */

/* receive a request / / send a response with MSG_EOF */