Sie sind auf Seite 1von 15

1

How to Diagnose TCP Connection Setup Issues?


This is the first article in a series of articles covering all you need to know to troubleshoot performance
issues impacting applications relying on the TCP Protocol. (See the bottom of this article for a full list)
In this article, we will consider the TCP connection setup.

Let’s have a look at how TCP sessions are established… and what can go wrong!

The TCP protocol is a connection-oriented protocol, which means that a connection is established and
maintained until the application programs at each end have finished exchanging messages. TCP works
with the Internet Protocol (IP).

TCP provides reliable, ordered, and error-free transmission. To do so TCP has features such as
Handshake, Reset, Fin, Ack, Push packets, and other types of flags to keep the connection alive and to
not lose any information.

TCP is used under a number of application protocols, such as HTTP, so it is important to know how to
diagnostic TCP issues. In this series of articles, we will explain TCP meta information and explain why
it is important for performance troubleshooting and how to measure it easily with SkyLIGHT PVX.

How Does a Session Start? Tcp Handshake & Connection Time

A TCP connection, also called 3-way Handshake is achieved with SYN, SYN+ACK and ACK packets.
From this handshake, we can extract a performance metric called Connection Time (CT), which
summarizes how fast session a can be set up between a client and a server over a network. For more
details, see this excellent article on Wikipedia.

Figure 1 – How TCP handshake is analyzed

The three steps of the TCP handshake are:


2

1. The ‘SYN’ is the first packet sent from a client to a server; it literally asks a server to open a
connection with it
2. If it’s possible, the server will respond with an ‘SYN+ACK’, means “I receive your ‘SYN’ and
I’m OK”
3. And finally, the client sends an ‘ACK’ to validate the connection

How to Diagnose TCP Connection Faults


1 – SYN Without Connections

A first case you can easily diagnostic with SkyLIGHT PVX is: “Could my clients connect to my
servers?” In the PVX navigation menu, go to Application → Clients, then choose the TCP theme and set
the Filter called “Only Unilateral Flow”. The pattern is that we only see traffic from the client to the
server and no response from the server.

Figure 2 – Filter on Unilateral Flows Only

This means that you want to see top client IPs with flows from the client only and without any
responses.

For Advanced Users of SkyLIGHT PVX

We set the filters to see unilateral flows, and this shows mostly ‘SYN‘ issues, however, you could also
get other types of flows. To query only the ‘SYN’ without connections and only them, use a custom
filter:

Fig
ure 3 – SkyLIGHT PVX finds unilateral flows and sorts them.

As you see on the results above, there are several IPs which demand to connect to a server (SYN > 0) but
they cannot connect to them (Connections = 0).
3

Here are common failure cases:

 A firewall denies those connections. In this case, you could apply the same query to client zones
(in the same menu) to see if the IPs are in the same zone.
 The server does not exist anymore or is not available. This happens frequently when a server IP
is changed, yet some clients continue to query the old one.

2 – Bad Connection Ratio

In a perfect world, you should have 1 ‘SYN’ per TCP connection. SkyLIGHT PVX provides a metric to
see this connection efficiency, it is an ‘SYN’ per Connection rate (which corresponds to the number of
SYN packets compared to the number of TCP sessions set up). This metric is available in the ‘details’
tables by using the TCP theme. You can also graph its evolution over time in Application → Custom
charts.

Figure 4 – PVX custom chart SYN/Conn

A bad ‘SYN’ efficiency is sometimes a network issue. Thus the misconnections are caused by packet loss
or contingency. You can check this assumption by looking at the Connection Time. If it remains low and
impacts several hosts, then it’s probably a network issue.

However, if the Connection Time is high, the issue is on the server side, it is overloaded and cannot
answer to all clients. Finally, if the ‘SYN’ ratio is huge, then you can have security issues, like a DDOS
attack.

Advanced SkyLIGHT PVX

The network latency – RTT (Round Trip Time) – can give you another indication that the issue is on the
network side. SkyLIGHT PVX provides the RTT in the Network Performancesmetric theme.
4

Figure 5 – Troubleshoot connections with Connection Times and SYN rates

Conclusion
In this first article, we saw a short presentation of TCP performance metrics and how the TCP protocol
handles the connections with SYN / SYN+ACK / ACK packets. We also see some common failure cases
that can be diagnosed easily with SkyLIGHT PVX.

To troubleshoot these kind of issues we used pages Top Clients, Top Client Zones and Custom Charts.
To go further, we used “Advanced Filter: Unilateral Flows” to filter flows with no responses.

We introduce several metrics: the number of ‘SYN’ and ‘Handshakes’ (connections), the SYN Efficiency
and the Connection Time.

Things that can go wrong when you close TCP sessions


This is the second in a series of articles covering everything that you need to know to troubleshoot
performance issues impacting applications that rely on the TCP protocol. After studying how TCP
sessions are established in our first article, we will now see what can go wrong when you close TCP
sessions.

Paradoxically, is it more complex to close a TCP connection than it is to create one! This is due to the
fact that resources must be correctly released on both sides of the connection, host A and host B.
5

Figure 1 – Simplified TCP closing with FIN.

The standard way to close TCP sessions is to send a FIN packet, then wait for a FIN response from the
other party.

1. A sends a FIN packet and waits for a response; it can release some resources but awaits the
response of the other part (Fin Wait)
2. B receives the FIN packet and must release resources; it waits for a closing application level
(Close Wait)
3. B can now send a FIN to A and then await its acknowledgement (Last Ack wait).
4. A can now fully close its job, but it must wait for network collision (?) (Time Wait); it may have
to send the final ACK another time.
5. B eventually receives the final ACK and destroys (kills) the connection.

This works fine in a perfect world. However, what happens when one part of the conversation is broken?
That’s why the Reset (RST) packet exists.

Figure 2 – RST sent to force the end of a TCP session.

Those abnormal terminations (i.e., either an aborted setup or a disconnection) could appear due to:

 A lack of resources or network interruption


 A crash/bug during the session
 While one pair has already closed its part of the connexion, the other part continues to send data
 The server refuses to open a connection to the client

SkyLIGHT™ PVX provides metrics to see if you close TCP connections properly or not. By selecting
the theme “TCP Events“, you get the count of FIN and RST packets in both directions.
6

Figure 3 – SkyLIGHT PVX showing client IPs sorted by server RST.

If the SYN rate per connection and the server RST are both high (Figure 3 – 1st row), this means that the
server is refusing the client connection demands. With a drill down to the conversations (see Figure 4),
you will have the precise server ports and applications that cause this issue. Hereunder, we see some
attempts to connect to a VPN from an IP address using port “1194”.

Fi
gure 4 – SkyLIGHT PVX showing the cause of RST without any connections.

For Advanced Users of SkyLIGHT PVXIf you want to filter on data with or
without RST or FIN packets, SkyLIGHT PVX provides some custom filters. In the previous article, we
already saw how to filter on connections.

 FIN: fin.count, fin.count.srv, fin.count.clt


 RST: rst.count, rst.count.srv, rst.count.clt

Examples:

 Flows with only RST and no FIN:


rst.count > 0 and fin.count = 0
 Flows with RST but without connections:
rst.count > 0 and ct.count = 0

Sometimes, RST packets are quite “normal”. For example when the user manually interrupts a huge data
transfer. The TCP session is sending packets as fast as possible, so when the client sends the FIN and
closes its part, the server is still sending lots of data for a moment. In this case, the client
sends RST packets until the server stops sending data. In this case, the is as client FIN (than server FIN),
but in addition, you will see some RST packets.

It is also important to note that some applications do not close sessions properly and simply use an RST
to close every session. While this is not a good practice, you must be aware that some applications are
developed this way.
7

It could also be relevant to graph some RST or FIN metrics over time. SkyLIGHT PVX provides a
metric to graph the rate of RST per connexion over time using Custom Graphs (Figure 5).

Figur
e 5 – SkyLIGHT PVX custom chart showing TCP RST over time in both directions (from client and
server).

CONCLUSION

We have seen how closing a TCP connection can be more complex than opening one. The session can be
closed by a double FIN, by a mix of FIN + RST, or only by RST packets. However, RST packets can also
be sent without any connection.

SkyLIGHT PVX helps to diagnose session issues by reporting statistics about FIN, RST, SYN, and
connections. It is also able to graph all the metrics over time, especially the RST per connection.

What Causes Network Packet Loss?


The two most common causes of network packet loss are:

 Layer two (L2) errors


 and network congestion

If a frame becomes errored from point to point on a connection due to cabling issues, duplex problems,
or other layer 1 events, the receiver will determine that the data is corrupted and drop it. In most cases,
an error counter will be incremented on the interface, which helps when locating where the loss
occurred.

Traffic congestion can cause input/output discards on interface links, especially when translating
between link speeds (10Gbps to 1Gbps for example). On these connections, the egress link may not be
8

able to keep up with the amount of ingress traffic, which may result in dropped packets. The sender of
the traffic will determine the loss occurred and retransmit. These are typically labelled as “discards” on
interfaces.

As we have seen in this series, TCP is a connection-oriented protocol. Part of the function of
establishing a connection is creating the mechanism to track data that has been sent and acknowledge
what is received. This way, TCP can detect if a packet goes missing and resend it accordingly, ensuring
reliable transmission of data.

Network packet loss: are we still coping with that today?

Yes. Despite the maturity of network links to 10Gbps and beyond, packet loss is still an underlying
network event that impacts applications today. To troubleshoot these issues, we first need to understand
how packets are dropped, how we can detect these events, and how we can resolve them.

TCP Retransmissions

Each byte of data sent in a TCP connection has an associated sequence number. This is indicated on the
sequence number field of the TCP header.

When the receiving socket detects an incoming segment of data, it uses the acknowledgement number in
the TCP header to indicate receipt. After sending a packet of data, the sender will start a retransmission
timer of variable length. If it does not receive an acknowledgment before the timer expires, the sender
will assume the segment has been lost and will retransmit it.

TC
P header
9

The TCP retransmission mechanism ensures that data is reliably sent from end to end. If retransmissions
are detected in a TCP connection, it is logical to assume that packet loss has occurred on the network
somewhere between client and server.

TCP Duplicate / Selective Acknowledgments

Most packet analyzers will indicate a duplicate acknowledgment condition when two ACK packets are
detected with the same ACK numbers.

TCP Duplicate / Selective Acknowledgments

How Do These Happen?

Sending TCP sockets usually transmit data in a series. Rather than sending one segment of data at a time
and waiting for an acknowledgement, transmitting stations will send several packets in succession. If
one of these packets in the stream goes missing, the receiving socket can indicate which packet was lost
using selective acknowledgments.

These allow the receiver to continue to acknowledge incoming data while informing the sender of the
missing packet(s) in the stream.

As shown above, selective acknowledgements will use the ACK number in the TCP header to indicate
which packet was lost. At the same time, in these ACK packets, the receiver can use the SACK option in
the TCP header to show which packets have been successfully received after the point of loss.
10

The SACK option is a function that is advertised by each station at the beginning of the TCP connection.
Most network analyzers will flag these packets as duplicate acknowledgements because the ACK
number will stay the same until the missing packet is retransmitted, filling the gap in the sequence.

Typically, duplicate acknowledgements mean that one or more packets have been lost in the stream and
the connection is attempting to recover. They are a common symptom of packet loss. In most cases,
once the sender receives three duplicate acknowledgments, it will immediately retransmit the missing
packet instead of waiting for a timer to expire. These are called fast retransmissions.

Connections with more latency between client and server will typically have more duplicate
acknowledgement packets when a segment is lost. In high latency connections, it is possible to observe
several hundred duplicate acknowledgements for a single lost packet.

Conclusion

If TCP Retransmissions and duplicate acknowledgments are detected on a connection, don’t assume that
the sky is falling and performance has come to a screeching halt. Depending on the network between
endpoints, a small amount of them may be normal.

For example, if a service provider is connecting end users to applications in a data center, or if the
application is hosted in a cloud environment, there are several connections that are beyond the control
and visibility of the network team. End users may perceive performance as normal, but a small number
of retransmissions may exist.

However, when troubleshooting an application performance problem with incrementing retransmissions


for the very users who are complaining, the underlying culprit is likely packet loss. Or at least, packet
loss will be a significant part of the puzzle.

Lost packets require retransmissions, which take time, which will slow applications down. Depending
on how many occur and how fast the endpoints can recover the missing packets, they can significantly
impact application performance.

In these cases, walk the link between client and server, analyzing link-level errors for all infrastructure
devices you control. It may be that you discover the faulty cable, Frame Check Sequence counter (FCS),
or discard indicator that is contributing to the packet loss.

What is a TCP Receive Window?


Simply put, it is a TCP receive buffer for incoming data that has not been processed yet by the
application.

The size of the TCP Receive Window is communicated to the connection partner using the window size
value field of the TCP header. This field tells the link partner how much data can be sent on the wire
before an acknowledgment is received. If the receiver is not able to process the data as fast as it arrives,
gradually the receive buffer will fill and the TCP window will be reduced in the acknowledgment
packets. This will alert the sender that it needs to reduce the amount of data sent or allow the receiver
time to clear the buffer.
11

TCP Receive Window

In the above diagram, the client and server are advertising their window size values as they
communicate. Each TCP header will display the most recent window value, which can grow or shrink as
the connection progresses. In this example, the client has a TCP receive window of 65,535 bytes, and
the server has 5,840. For many applications, since clients tend to receive data rather than send it, clients
often have a larger allocated window size. After the handshake, the client sends an HTTP GET request
to the server, which is quickly processed. Two response packets from the server arrive at the client,
which sends an acknowledgment along with an updated window size. The client was able to process the
data packets out of the TCP buffer as fast as they came in, so the window size was not reduced. The
client still has a full window available for receiving data – 65,535 bytes.

In another example, a client is requesting data from a server and begins to receive the data. However, in
this case, the client is not able to quickly process the incoming data. The TCP buffer begins to fill, as
indicated by the reduced window value.
12

TCP
Receive Window and TCP buffer

The acknowledgements from the client indicate that the window is shrinking. As long as the
window value does not fall to zero, this behavior will largely go unnoticed by the end user. Although the
number is slightly reduced, there is still plenty of room in the buffer for data transfer to continue. In
many cases, the client can catch up and will process the data out of the buffer, clearing the window out
and increasing the window value.

TCP Window Scale

The TCP header value allocated for the window size is two bytes long. This means that the highest
possible numeric value for a receive window is 65,535 bytes. In today’s networks, this window size is
not enough to provide optimal traffic flow, especially on long, fat networks (links that have high
bandwidth and high latency). In its native state, TCP cannot take advantage of these high-performance
links since it can only send a maximum of 65,535 bytes at a time.

For this reason, TCP Options were introduced in RFC 1323 that enable the TCP receive window to be
increased exponentially. The specific function is called TCP Window Scaling, which is advertised in
the handshake process. When advertising its window, a client or server will also advertise the scale
factor (multiplier) that will be used for the life of the connection.
13

TC
P Window Size information seen in Wireshark

In the image above, the sender of this packet is advertising a TCP Window of 63,792 bytes and is using
a scaling factor of four. This means that that the true window size is 63,792 x 4 (255,168 bytes). Using
scaling windows allows endpoints to advertise a window size of over 1GB. To use window scaling,
both sides of the connection must advertise this capability in the handshake process. If one side or the
other cannot support scaling, then neither will use this function. The scale factor, or multiplier, will only
be sent in the SYN packets during the handshake and will be used for the life of the connection. This is
one reason why it is so important to capture the handshake process when performing TCP analysis.

What Is a Zero Window?

When a client (or server – but it is usually the client) advertises a zero value for its window size, this
indicates that the TCP receive buffer is full and it cannot receive any more data. It may have a stuck
processor or be busy with some other task, which can cause the TCP receive buffer to fill. Zero
Windows can also be caused by a problem within the application, where the TCP buffer is not being
retrieved.

Example of a TCP Zero Window

A TCP Zero Window from a client will halt the data transmission from the server side, allowing time for
the problem station to clear its buffer. When the client begins to digest the data, it will let the server
know to resume the data flow by sending a TCP Window Update packet. This will advertise an
increased window size and the flow will resume.

How Can We Detect TCP Zero Window?


14

Window problems are usually observed on applications that move a lot of data such as backups, file
transfers, and large downloads. If a performance problem is hampering data transfer, look for window
problems on the receiver.

SkyLIGHT PVX can monitor for Zero Window conditions and displays statistics about which
connections suffered them and when. If these problems are observed in SkyLIGHT PVX, focus on the
station that is advertising the Zero Window value. Remember that this indicates the TCP receive buffer
has been exhausted and data flow will stop until the buffer is cleared. These are usually caused by stuck
processes on the client, under-resourced PCs or an application that is not tuned to receive high rates of
data.

As an example, if we consider an application where we can observe numerous 0-Windows events


generated by the 223 clients.

You can easily drill down to the clients involved in the phenomenon and confirm the impact on the data
transfers and End User Response Times:

Top clients by number of Zero Window events

You could also view the evolution through time to understand if it is a continuous or intermittent issue:

Zero
Windows events trend through time
15

Why Should You Care About TCP Window Problems and TCP
Eindowing in General?
You should care about TCP window problems because they ultimately determine the speed of data
transfers and hence the experience of your users accessing the applications. In this video, you will
learn more about TCP windows in general, TCP Receive windows in particular and discover how they
can impact performance.

A Primer About TCP Windows


The throughput of a communication is limited by two windows: the congestion window and the
receive window. The congestion window tries not to exceed the capacity of the network (congestion
control); the receive window tries not to exceed the capacity of the receiver to process data (flow
control). The receiver may be overwhelmed by data if for example it is very busy (such as a Web
server). Each TCP segment contains the current value of the receive window. If, for example, a
sender receives an ack which acknowledges byte 4000 and specifies a receive window of 10000
(bytes), the sender will not send packets after byte 14000, even if the congestion window allows it.

According to Wikipedia,

Das könnte Ihnen auch gefallen