TCP Performance Enhancement in Wireless Environments: Prototyping in Linux

Georg-August-Universitt
Gttingen
Zentrum fr Informatik
ISSN 1612-6793
Nummer GAUG-ZFI-BSC-2008-05
Bachelorarbeit
im Studiengang "Angewandte Informatik"
TCP Performance Enhancement
in Wireless Environments:
Prototyping in Linux
Swen Weiland
Arbeitsgruppe fr
Computernetzwerke
Bachelor- und Masterarbeiten
des Zentrums fr Informatik
an der Georg-August-Universitt Gttingen
13. Mai 2008
Georg-August-Universitt Gttingen
Zentrum fr Informatik
Lotzestrae 16-18
37083 Gttingen
Germany
Tel. +49 (551) 39-14414
Fax +49 (551) 39-14415
Email ofce@informatik.uni-goettingen.de
WWW www.informatik.uni-goettingen.de
Ich erklre hiermit, dass ich die vorliegende Arbeit selbstndig verfasst und keine
anderen als die angegebenen Quellen und Hilfsmittel verwendet habe.
Gttingen, den 13. Mai 2008
Bachelorarbeit
TCP Performance Enhancement in Wireless
Environments: Prototyping in Linux
Swen Weiland
13. Mai 2008
Betreut durch Prof. Dr. Xiaoming Fu
Arbeitsgruppe fr Computernetzwerke
Georg-August-Universitt Gttingen
Contents
1 Introduction 6
1.1 Motivation to Optimize Wireless Networks . . . . . . . . . . . . . . . . . . . 7
1.2 Contribution of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background and Related Work 9
2.1 TCP Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Existing Works on TCP Improvements in Wireless Networks . . . . . . . . . 10
2.3 TCP Snoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Basic Idea and how Enhancements are achieved . . . . . . . . . . . . 13
3 Proxy Implementation Design 15
3.1 Overview and a Brief Function Description . . . . . . . . . . . . . . . . . . . 15
3.2 Interface to the Kernel for capturing . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Capturing Packets - a Comparison of Methods for Capturing . . . . . . . . . 20
3.3.1 libPCap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.2 RAW Sockets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.3 Netlter Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.4 Kernel Ethernet Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.4.1 Netlter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.5 Conclusion of the Comparison . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Module Design of the Proxy Implementation . . . . . . . . . . . . . . . . . . 28
3.4.1 Operating System requirements . . . . . . . . . . . . . . . . . . . . . 28
3.4.2 Module: Buffer Manager . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.3 Module: Netlter Queue Interface . . . . . . . . . . . . . . . . . . . . 31
3.4.4 Module: TCP Connection Tracker . . . . . . . . . . . . . . . . . . . . 34
3.4.4.1 Prime Numbers for the Secondary Hash Function . . . . . . 35
3.4.4.2 Identify a TCP ow . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.4.3 Retransmission Buffer . . . . . . . . . . . . . . . . . . . . . . 39
3.4.5 Module: Timer Manager . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.6 Module: Connection Manager . . . . . . . . . . . . . . . . . . . . . . 42
2
Contents
3.4.6.1 Stateful tracking . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.6.2 Round Trip Time calculation . . . . . . . . . . . . . . . . . . 45
4 Evaluation 48
4.1 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.1 TCP Connection Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.2 Connection Manager with implemented TCP Snoop behavior . . . . 50
4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5 Conclusions 55
5.1 Summarization of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Future Work and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Bibliography 58
3
List of Figures
2.1 TCP Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Overview: Implemented modules and their communication . . . . . . . . . 16
3.2 Dataow with libPCap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Dataow with RAW Sockets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Ethernet Type II Frame format . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Dataow with Netlter Queue . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6 Logical structure: Linked list used as a FIFO . . . . . . . . . . . . . . . . . . 29
3.7 Implementation: Linked list at initial state . . . . . . . . . . . . . . . . . . . . 30
3.8 Implementation: Linked list after retrieval of a chunk . . . . . . . . . . . . . 30
3.9 Implementation: Linked list after returning the chunk . . . . . . . . . . . . . 31
3.10 TCP Handshake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.11 Simplied TCP State Machine Diagram from the Connection Tracker . . . . 47
4.1 Testbed for initial TCP connection tracking test . . . . . . . . . . . . . . . . . 49
4.2 Testbed for TCP connection tracking test . . . . . . . . . . . . . . . . . . . . . 50
4.3 Testbed for TCP Snoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4
Abstract
In recent years, wireless communication gets more and more popular. Future wireless
standards will reach throughputs much higher than 100 Mbit/sec on link layer. However,
wireless channels, as compared to wired lines, exhibit different characteristics due to fad-
ing, interference, and so on. For transport control protocol (TCP), the misinterpretation
of packet loss due to wireless channel characteristic as network congestion results in sub-
optimal performance. There are many different approaches to enhance TCP over wireless
networks, especially for slow and lossy links such as satellite connections. This thesis eval-
uates TCP Snoop as one of these approaches for high transfer rates. Finding, using and
implementing effective capturing, buffering and tracking of TCP communication were the
objectives to solve. A general and transparent TCP proxy with TCP Snoop behavior
was implemented during the work for this thesis. The TCP proxy runs on an interme-
diate Linux host which connects wired and wireless networks as a prototype user space
application with a modular design.
Different trafc capture methods are compared in portability and performance. A full TCP
connection tracking is described and implemented. Design patterns and methods that
proofed their benet in practice were applied and sometimes partially modied to t into
the needs of the transparent TCP proxy. The modular design makes exchanging a low
level module such as the data trafc capture module possible. Porting the implementation
to another operating system, another platform like embedded systems which are used as
wireless LAN routers or changing the TCP enhancement method are also eased by the
modular design.
The results show that a transparent TCP proxy or other trafc modifying implementation
should not reside in the user space for performance reasons. A kernel space implementa-
tion or even better a dedicated hardware like a network processor platformshould be used
for such implementations.
1 Introduction
Nowadays, the Internet is more and more based on wireless technologies, especially the
last few meters or even few miles. For example, many of the DSL Internet connectivities
in Germany are bundled with subsidized wireless access points in their product packages.
From my personal experience, people like to sit on the sofa or in the garden outside the
house, while they are still using the Internet for surng, chatting and downloading. In
such scenarios, wireless is becoming more and more desirable.
Accordingly to the trafc analysis study by Simon Leinen at Columbia, the majority of
trafc, about 74% [Leinen], of the Internet is TCP. An extract of the results from his traf-
c analysis is presented on the bottom of this page as Table 1.1. As TCP originally was
designed for wired communications, there are some drawbacks for wireless scenarios. If
these issues could be solved or at least optimized, this would also optimize the majority of
Internet trafc.
This thesis focuses on the TCP performance enhancement issues over wireless environ-
ments. More specically, I performed a prototype implementation of a transparent TCP
proxy as user space application, for optimizing end-to-end TCP performance. The imple-
mentation is meant to be run on or very near to the last hop to a mobile node. A user
space application benets from a straight design and is independent from any restrictions
for implementing a kernel module. The lowered security level of a user space application
protects the operating system and leads to a good system stability. Moreover, debugging
is easier in user space and simplies rapid prototyping.
Protocol Flows Flows (%) Packets Packets (%) bytes Bytes (%)
GRE 383 0.00 % 17235 0.00 % 3602115 0.00 %
ICMP 101931237 1.75 % 305793711 0.45 % 37918420164 0.11 %
IGMP 34662 0.00 % 901212 0.00 % 58578780 0.00 %
IP 1406788 0.02 % 15474668 0.02 % 3528224304 0.01 %
IPINIP 1297 0.00 % 1297 0.00 % 583650 0.00 %
TCP 4361852662 74.91 % 63919315234 93.39 % 32455859980970 96.82 %
UDP 1357265629 23.31 % 4201556174 6.14 % 1025284993546 3.06 %
Table 1.1: Analysis taken by Simon Leinen on an access router at a random university [Leinen].
6
1 Introduction
1.1 Motivation to Optimize Wireless Networks
In TCP, packet losses are interpreted by the TCP stack as congestion by default. For wired
network hardware today, packet loss is not a real problemanymore because of the very low
bit error rate. However, wireless and mobile networks are often characterized by sporadic
high bit-error rates, intermittent connectivity and effects of interferences [Caceres]. This
results in higher bit error rates than in wired networks. Additionally a sender lowers the
packet sending rate due to the misinterpretation of packet losses as congestion.
To avoid this, some researches propose various TCP enhancements [Bakre], [Balakrishnan1995],
[Chen] which try to reduce or eliminate these impacts. For applying some enhancement
to a network, in most cases the infrastructure has to be changed or the nodes have to be
recongured. If a transparent TCP proxy is used, nothing of the infrastructure has to be
changed, only the proxy function is added.
The positioning of such a proxy should be as near as possible to the wireless part, in order
to react more quick on changes in the wireless part of the network. Directly in a base
station of a wireless network is the nearest and therefore best position. I assume it has
sufcient RAM and processing power to perform necessary proxy function, which could
be plausible due to todays manufacture technologies and I will also go back to this issue
in the evaluation in the later parts of this thesis.
1.2 Contribution of This Thesis
The contribution of this thesis can be summarized as follows:
Representative approaches for TCP enhancements over wireless environments are
identied and classied in three groups.
A transparent proxy approach, TCP Snoop [Balakrishnan1995], [Balakrishnan1996],
[Balakrishnan1999], is selected for primary study due to its nice tradeoffs between
functionality and complexity. A software design over Linux is presented and imple-
mented.
A key technique used in transparent proxy approaches, namely data capturing, is
identied. A data capturing solution is chosen out of several alternative solutions,
according to their ability and performance to capture and modify the throughpassing
trafc.
Finally, a performance analysis of the implementation is given and the TCP Snoop
approach as user space application is evaluated systematically.
7
1 Introduction
1.3 Thesis Organization
The thesis starts with a general survey of related work, including an introduction to TCP
Snoop as the background for the software design and implementation. I am then dis-
cussing the design of the software framework:
Firstly requirements and dependencies to other software are introduced, followed by a
bottom up overview about the general design of the software with a short functional de-
scription. After deciding the most efcient capture method for this implementation, the
thesis describes how each specic module was designed, implemented, which design pat-
ters were applied and why the implemented algorithms were chosen. Note that source
code is not discussed directly, only necessary function descriptions or data structures in
order to give a closer view on the software. The implementation is then evaluated with
some test cases and analyzed in terms of performance.
For the convenience of the readers, below I give a short summary of the notation I use in
this thesis:
Bibliographic sources are given as references in brackets.
Shell commands and source code is shown in framed boxes.
Boldformattedwords are representing references to labeling on gures and/or names
of software modules. If a word is formatted italic, it represents an important expres-
sion or a project name of a software implementation which was used or compared
to.
References to the Internet are directly inserted into the text as footnotes.
8
2 Background and Related Work
2.1 TCP Basis
Basic knowledge about Internet Protocol (IP) and the Transmission Control Protocol (TCP)
is assumed. In this thesis the focus is on TCP and in this paragraph some facts about TCP
are recapitulated which are referred to later or explained later in more detail.
If you have good knowledge about TCP, the rest of this section can be skipped and you can
continue reading at section 2.2.
TCP is a reliable stream delivery service that guarantees to deliver a stream of data sent
from one node to another without duplication or losing data.
It also provides ow control and congestion control. A ow or connection is identied by
a source and destination IP address and especially by the source and destination TCP port
number.
Reliability is achieved with acknowledgment messages, which are sent along with data
packets or standalone with an empty data packet. These acknowledgment messages are
represented as special elds in the TCP protocol header.
To be precise, the acknowledgment number (ACK number) and the acknowledgment ag
(ACK ag) are meant.
The structure of the TCP protocol header is shown on the following Figure 2.1 on page 10.
Position of the ACK number and the ACK ag can be looked up on the gure.
By comparing the sequence number of already sent packets with the acknowledgment
number of a currently received packet, the TCP stack can decide which packets have
reached their destination, are still on the way or are lost. All sent out packets in the outgo-
ing buffer with a sequence number lower than the acknowledgment number of a received
packet have reached their destination.
Every time a packet or a bunch of packets are sent out a retransmission timer is started.
A loss event is a timeout of this timer or 3 duplicate ACKs.
The fast retransmit algorithmuses the arrival of 3 duplicate ACKs (4 identical ACKs with-
out the arrival of any other intervening packets) as an indication that a segment has been
lost. [RFC2581, paragraph 3.2 on page 6]
9
0 1516 31
Bit 32
~
~
~
~
~
~
~
~
source port destination port
data
offset
window
checksum urgent pointer
options
(0 or more 32-Bit-words)
data
sequence number
acknowledgment number
reserved
U
R
G
A
C
K
P
S
H
R
S
T
S
Y
N
F
I
N
Figure 2.1: TCP Header
Loss events are the most important information for a TCP proxy, because it is one of the
main tasks to react on and solve these loss events.
Later in 3.4.4 more parts of TCP header are addressed and Figure 2.1 can be used as a
reference.
2.2 Existing Works on TCP Improvements in Wireless Networks
This section gives a brief overview about related work of mechanisms for improving TCP
performance over wireless links.
Accordingly to the work of Balakrishnan et al. [Balakrishnan1996] and Xiang Chen et al.
[Chen] these mechanisms can be grouped into:
End-To-End
End-To-End schemes apply enhancements directly into the TCP stack or extend it as
TCP options. This implies a modication of the TCP stack, which is mandatory for
both communication partners if they want to benet from this type of enhancements.
Examples are TCP-NewReno or Explicit Loss Notication (ELN) [Balakrishnan1998].
The behavior of the TCP stack for loss events is optimized or additional information
is sent out and processed to realize a better differentiation between congestion and
packet loss caused by link layer errors.
10
Link-Layer
Link-Layer approaches try to make the link layer aware of higher layer protocols like
TCP. No TCP stack modication on the communicating nodes is required, but an
intermediate node is added to the network infrastructure.
An example for this group is TCP Snoop.
Link layer errors are corrected by enhanced retransmission techniques and because
of the transport layer awareness these errors can be hidden. Misinterpretations by the
transport layer that congestion occurred instead of link layer errors are suppressed.
Split-Connection
As the name suggests, a ow of a connection oriented and reliable transport layer
protocol, e.g. TCP, is split at an intermediate node into two separate ows. Therefore
the intermediate node needs to maintain two separate TCP stacks, but data copying
between these stacks is avoided by passing only pointers and having a shared buffer.
Examples for split connection schemes are I-TCP and SPLIT.
All mechanisms are very similar, because they ght against the same issues with slightly
different methods and effort.
Link layer transmission errors are detected and misinterpretations that congestion oc-
curred are avoided. Asymmetric network links are handled more efciently and addi-
tionally some of these mechanisms do caching and local retransmission to recover more
efciently from packet losses.
Modifying the TCP stack like the End-To-End approaches is very effective, because wrong
or ineffective behaviors are suppressed directly at their source. Additional processing
caused by this type of modications is often very little. On the other side, these approaches
are hard to deploy because in most cases both end-nodes needs to implement these modi-
cations to benet from them and in some cases also middle-nodes are involved. Link layer
feedback is limited and issues are mainly xed on transport layer.
For Link-Layer or Split-Connection approaches only an intermediate node or module is added
to the network infrastructure and no end-node has to be modied. This means they are
more easy to deploy, but adding an additional node means also raising the needed pro-
cessing power for communication. An intermediate node can only guess the state of a TCP
stack and only inuence it indirectly.
Split-Connection approaches try to solve the indirect inuence issue by maintaining two
separate TCP stacks for each ow on the intermediate node, but this also nearly doubles
the processing overhead.
Link-Layer approaches are a good tradeoff, because they are more easy to deploy and their
reduced complexity against Split-Connection approaches. Further Link-Layer approaches do
not break the end-to-end communication like Split-Connection approaches, which makes
roaming in wireless networks possible without any trouble.
11
2.3 TCP Snoop
2.3.1 Overview
As by Balakrishnan et al. [Balakrishnan1995] dened, TCP Snoop sought to improve the
performance of the TCP protocol for wireless environments without changing the exist-
ing TCP implementations neither in the wired network nor in the wireless network. It is
designed to be used on or near to a base station of a wireless network, to enhance TCP
end-to-end connections. Every through passing TCP packet is buffered in a local memory
for doing a fast and local retransmission if a packet loss occurs on the wireless part of the
network. TCP Snoop behaves as a transparent proxy and maintains a cache of TCP pack-
ets for each TCP ow. Lost packets are detected and locally retransmitted. Therefore each
ow is tracked with an additional but simplied TCP stack in the proxy implementation.
TCP Snoop does not break end-to-end connectivity like Split-Connection approaches and
stays completely transparent. This makes roaming from one wireless part to another wire-
less part of the same network is possible.
Reasons for treating wireless networks for TCP enhancement differently are the following.
Wired network links are very reliable and generally a higher bandwidth compared to wire-
less links. Reasons for this are the physical medium access method Carrier Sense Multiple
Access/Collision Avoidance (CSMA/CA) and the medium itself. The wireless medium is
more vulnerable to physical interference and it is only a half duplex medium. In opposite
wired communication is mainly used as a full duplex medium and can easily be shielded
as protection against physical interference.
Triggered by lost packets or duplicated ACKs which are also used for signaling lost pack-
ets, a TCP stack may suspect a congested link and lowers its sending rate. This leads to
a suboptimal usage of bandwidth for wireless networks, because congestion in standard
TCP is only detected by losses but in wireless networks there are many reasons for losses.
Wireless links can recover to a higher bandwidth very quick if an interference stopped or
weakened, but the TCP stack detects this very slowly compared with detecting congestion.
There is simply no signaling for the bandwidth changes in the standard TCP stack.
To avoid this, duplicate ACKs are suppressed by the TCP Snoop proxy and local retrans-
mission is triggeredfor every lost packet. For wired links the congestion assumption is nor-
mally the right choice, because of the lag of temporal high bit error rates in the medium.
With a high probability wired links have reached their current maximum bandwidth if
packet losses occur, which means congestion.
12
2.3.2 Basic Idea and how Enhancements are achieved
TCP Snoop is implemented as a transparent TCP proxy near to or on a base station of a
wireless network. A TCP proxy has two Ethernet devices and forwards all trafc from one
device to the other. On the way from one device to the other, a packet can be modied by
the proxy if necessary and this modication is also transparent for the network. Non TCP
trafc is ignored and just forwarded, but TCP trafc is processed by the proxy before it is
also forwarded. Processing is dened as identifying each TCP ow and tracking it.
If a packet loss is detected during the tracking, a local retransmission is done. Losses are
detected by a certain amount of duplicate ACKs and by timeouts of a locally installed
retransmission timer, which is part of a simplied TCP stack in the proxy. The simplied
TCP stack is utilized for tracking TCP and tracks SEQ numbers, ACK numbers and other
dynamic values of a previously identied TCP ow.
The proxy should be placed as near as possible to the wireless base station on the network
topology to reduce response times.
All unacknowledged packets from the xed host (FH) to the mobile host (MH) are cached
in the buffer of the proxy. This should be a buffer in a fast memory like DRAM or SRAM.
Unnecessary congestion control mechanism invocations for a TCP ow are avoided due
hiding duplicate ACKs and doing the local retransmission.
The authors of the original paper that dened TCP Snoop updated it [Balakrishnan1996],
[Balakrishnan1999] to improve the performance for packet losses on the way from the MH
to the wireless base station and is implemented by sending selective retransmission re-
quests to the MH and be described as follows.
Packets from the MH to the FH are processed and cached as normal, but if a gap in the
ascending sequence numbers is detected, a Negative Acknowledgment (NACK
1
) is sent
from the proxy to the MH, which triggers a retransmission. A new and for the TCP proxy
cacheable copy of the lost packet should be on the way if the FH realizes that it was lost.
Native TCP supports only positive acknowledgment of packets. The Selective Acknowl-
edgment (SACK) is a TCP option, which allows sending selective Acknowledgments (ACKs)
or selective Negative Acknowledgments (NACKs) for specic packets. A TCP option de-
nes an extension to TCP, which is sent in an optional part of the TCP header.
To verify that using this TCP SACK option and the proposed updates are applicable, I
investigated about the distribution of SACK. From December 1998 to February 2000, the
fraction of hosts sampled with SACK-capable TCP has increased from8%to 40% [Allman].
Today it should be 90% and above, because SACK is supported by any major operating
1
Part of the Selective Acknowledgment (SACK) TCP option. Standardized by RFC 2018.
13
system. To give some names, SACK is supported by Windows (since Windows 98), Linux
(since 2.2), Solaris, IRIX, OpenBSD and AIX. [Floyd]
The host support of the SACK extension is detected by the proxy during the three-way-
handshake which establishes a TCP ow.
If SACK is not supported by one of the two hosts, it cannot be used for this ow at all. This
is dened by RFC 2018, which denes TCP SACK option. In this case the proxy skips the
enhancement for the trafc from the MH to the FHand just tries to enhance the trafc from
the FH to the MH.
14
3 Proxy Implementation Design
3.1 Overview and a Brief Function Description
The TCP proxy prototype implements a transparent TCP enhancer. With a modular design
for easier extension and the possibility to implement other TCP enhancements in future
work. For enhancing the TCP trafc the proxy must have the ability to drop or modify
through passing TCP packets. To achieve this, the TCP proxy breaks the physical medium
and puts itself in between (see gure 4.3) as an intermediate node. This gives the TCP
proxy the total control which packet is forwarded, modied or dropped, because he has to
forward each packet from one interface to another.
The following gure 3.1 on page 16 gives an overview about the general software design
of the proxy and its core modules. Details of each module and all applied design patterns
or used algorithms are described later in this chapter at 3.4.
For implementation the C programming language was used. Multi-threading, thread syn-
chronization and mutual exclusions (mutexes
1
) were implementedby using POSIXThreads.
POSIX Threads is a POSIXstandard for threads and denes a standard API for creating and
manipulating threads.
Mutexes are neededto secure and order asynchronous write operations. The POSIXthread-
implementation is platform independent, well standardized, good documented and it is
usable as a simple library or integrated in the operating system.
The main target platform is Linux with kernel version 2.6.14. This is conditioned by
the need of using libnetlter_queue
2
as interface in the kernel, which is used for capturing,
ltering and modifying TCP packets. This extension of the Netlter packet lter demands
this version of the Linux kernel. Netlter is the standard packet lter of the Linux operat-
ing system. All needed functionality to use this interface is implemented in the Netlter
Queue Interface (see gure 3.1) module of the proxy.
To support older kernel versions, which are often used on embedded system like Wireless
LAN (WLAN) routers, only this Netlter Queue Interface module has to be replaced with
1
http://en.wikipedia.org/wiki/Mutual_exclusion
2
Netlter Queue - see 3.3.3 for further description
15
Connection Manager
TCP Connection
Tracker
Packet Generator /
Manipulator
O
n
l
y

T
C
P
Buffer Manager
With Buffer
Timer Manager
Netfilter Queue
Interface
Retransmitted TCP
traffic via
RAW Socket
Kernel Ethernet Bridge
Device
Ethernet Device #0 Ethernet Device #1
TCP Snoop
Acknowledgement
for forwarding
Newly created
TCP packets
Free buffer
chunks
Figure 3.1: Overview: Implemented modules and their communication
some other module which does the capturing and ltering of the TCP trafc.
Such a module could use RAW-Socket
3
for capturing and a generic protocol classier to
detect the TCP trafc. Also a back-port to earlier versions of the Netlter framework would
be possible with some but minimal effort. Back-porting to some embedded router could
be done in future works if needed. The modular design makes this possible and easier
achievable.
The Netlter interface is chosen from other methods of capturing for performance reasons,
which is described and analyzed later in 3.3.
A proxy instance consists of three threads. The main thread, which is created by the op-
erating system, is the rst thread. After some basic initialization the main thread installs
an operating system callback function which triggers the Timer Manager module. This
callback function is called in a xed interval. It is counted as a separate thread, because
3
RAW-Socket - see 3.3.2 for further description
16
actions can be triggered or data manipulated asynchronous to the main thread.
The third thread is also created by the main thread and used for capturing. The capturing
thread resides in the Netlter Queue Interface module.
Retransmission is implemented by utilizing RAW-Sockets. These allow creating and send-
ing custom TCP packets. Every eld in IP and TCP header can be set to a custom value. In
the case of the TCP proxy, mainly a previously buffered packet is just retransmitted.
To do the retransmission, the proxy must be aware of each TCP ow and its state. Only
with this information the proxy knows the point in time and which TCP packet has to be
retransmitted.
After some TCP trafc was captured by the Netlter Queue Interface module, it is handed
over to the TCP Connection Tracker module.
The TCP Connection Tracker module gathers the following information:
Source and destination IP address
Source and destination TCP port
Packet is stored in a queue (one for each direction by connection)
Pointer to the corresponding management structure
After the TCP ow is identied by the TCP Connection Tracker and the corresponding
management structure is known or created, all this information is passed to the Connec-
tion Manager module which adds the following information to the management structure:
Connection Status
State (TCP State-Machine)
Acknowledge Number (ACK)
Sequence Number (SEQ)
Present TCP options
Round Trip Time (RTT)
Timer (for retransmission)
17
The Connection Manager is the main module, makes all important decisions and can be
seen as a concentrator for all the information. TCP packets can be forwarded, dropped,
modied or retransmitted. Also new packets can be created. As you see in gure 3.1, in
this module the TCP Snoop behavior is implemented.
There are also three helper modules: Buffer Manager, Timer Manager and Packet Gener-
ator/Manipulator.
The Buffer Manager offers some management functions for the buffer memory of
the proxy. A big memory block is reserved at the initialization of this module and
divided into small chunks. Some other module, mainly the Netlter Queue Interface
for capturing and storing new data, can retrieve an unused chunk from the Buffer
Manager. If a chunk is not needed any more, it is returned to the Buffer Manager.
Different chunk managements are implemented. They are used during capturing,
like described before, and for the queue management per TCP ow. The TCP packets
for each tracked TCP ow are buffered in a special queue. There is one queue per
communication direction of the ow.
The Timer Manager is used by the Connection Manager to install the retransmission
timers and to install special timeout timers for some TCP states. It offers to trigger a
specic action after a specied time period to other modules.
This Timer Manager module is necessary, because an application thread can only in-
stall one timer callback with only one xed interval.
One callback would be not enough if the Connection Manager wants to install at
least one retransmission timer for each TCP ow. Therefore the Timer Manager han-
dles and manages this one callback for every module which wants to install a timer
or as many timers they want. Repeating and one-time timers are possible.
The callback is realized and handled with the signal handling of the operating sys-
tem. Another way to implement it would be busy waiting in a separate thread, but
this is denitely a bad choice. Busy waiting is in most cases a bad choice for software.
The Packet Generator/Manipulator is only used by the Connection Manager to cre-
ate new packets or modify packets.
For example, hiding duplicate ACKs or sending a NACK for a specic packet to the
original sender.
18
3.2 Interface to the Kernel for capturing
The Linux operating system segregates its memory into kernel space and user space.
Kernel space is privileged and strictly reserved for the kernel and device drivers. A normal
application runs in user space and has no direct access to kernel space memory areas.
If a TCP packet arrives, it is pushed as an Ethernet frame from the device driver of the
Ethernet card to the Ethernet stack in the kernel space. Fromhere it is passed to the TCP/IP
stack in kernel if it was identied as IP trafc. We assume this as example.
To send this packet to an application running in user space, it has to be copied to a memory
area in user space. This has to be done, because the kernel space memory is not directly
accessible from user space. In the case of passing data from the kernel to a user space
application, it happens nothing more than duplicating a memory area to the user space.
Also if the packet is not needed any more after duplicating, the data has to be copied to
an address space that is assigned to the user space. This effort has to be done for security
reason. Just remapping the kernel space memory block to the user space is not possible.
Copying memory and throwing one of the copies away is an expensive operation. How to
reduce the amount of copy operations is shown in the next section 3.3.
Via the TCP socket, a kind of interface library, the application passes a pointer to a memory
area in user space for the incoming packet to the kernel. The kernel duplicates the buffer
with the incoming packet to the memory in user space, addressed by the pointer.
Normally the Ethernet card passes only Ethernet frames which are addressed to its Media
Access Control address (MAC address) to the device driver. But the card can be set into a
special mode. This mode is called promiscuous mode and makes the Ethernet card pass
all Ethernet frames on the wire to the device driver. In this mode the kernel has to process
all the trafc on the wire. The trafc which is destined to its own IP address and to any
other host in the Ethernet subnet.
Additional trafc to other hosts caused by the promiscuous mode is dropped by the
Ethernet stack or later in the IP stack of the kernel. Only trafc to the own host and which
is addressed to a corresponding application on the host is passed to the user space.
To get a copy of all trafc to user space, a special kernel interface is needed, especially for
the trafc which is destined to other hosts. With such an interface it is possible to grab
a copy of the trafc before it reaches the Ethernet stack or before it reaches the IP stack.
Means, Ethernet frames or IP packets can be grabbed and pulled to the user space.
Such an interface can be a RAW Socket (described at 3.3.2) or a special kernel module like
Netlter (described at 3.3.3).
The action of using such an interface is usually known as snifng or capturing.
19
As we now know, passing trafc to the user space is expensive. But passing all trafc from
the wire to user space is very expensive!
The promiscuous mode leads also to much more work for the kernel, because the trafc
which is not destined to the own host is also processed. On slow machines this can easily
head into performance problems.
3.3 Capturing Packets - a Comparison of Methods for Capturing
In this section different methods for capturing are shown and compared in focus of perfor-
mance and usability for the TCP proxy.
The selected methods are chosen because of their popularity and they are present and/or
usable for at least 5 years. This should assure, that a prototype based on such an interface
or library will be usable with newer versions of operating systems and/or these interfaces.
Other proprietary kernel modules, which are working only for a fewLinux kernel versions
were ignored.
3.3.1 libPCap
The Packet Capture library (libPCap
4
) provides a multi platform and high level interface
for packet capturing on transport layer.
It puts the Ethernet device into promiscuous mode and supports many common Operat-
ing Systems like different Linux, Windows, MacOS and BSD versions. The interface to the
library is well documented and packet capturing can be implemented within a few lines of
source code.
Early versions of my TCP proxy used libPCap for capturing, because of the easy han-
dling and the support for many platforms. The library itself uses RAW Sockets (3.3.2) for
capturing Ethernet frames and abstracts only the usage of RAW Sockets on the different
platforms.
Additionally it adds a buffer management and lter management. The application does
not have to care about the handling of an incoming buffer and also buffer overows are
handled by the library. Just the buffer size and a callback function for handling incoming
frames needs to be dened during the initialization.
4
http://www.tcpdump.org/
20
All the trafc from the wire is stored in the buffer and then ltered. If no predened
lter applies, the library calls the previously dened callback function. In this function the
Ethernet frame has to be processed by the application. After the callback function returned
the control to the library, the memory with the Ethernet frame is freed and reused for
capturing.
Freeing the memory after the callback function is the main disadvantage for implementing
the proxy. If the proxy wants to retransmit a packet, the frame with the packet has to be
copied during the callback to another buffer (see (5) on gure 3.2). If not, the packet would
be lost for the proxy after the callback. It cannot be retransmitted if this is needed.
Callback Function
Copy
(only TCP)
Ethernet Devices
With Device Driver
Buffer in
Ethernet Stack
libPCap Buffer
Copy with RAW Socket
Pointer
TCP proxy
Retransmission Buffer
Kernel Space
User Space
IP Stack
Netfilter
framework
To some Application.
Copy with IP Socket to Userspace.
Pointer
Pointer
Copy with RAW Socket
(All traffic)
eth1 eth0
5
1
2
3
4
Figure 3.2: Dataow with libPCap
Second disadvantage is, that the ltering (would happen before (3)) of libPCap is done
in user space. Using the internal ltering of libPCap would be a reduction of data that
need to be processed by the proxy and therefore an optimization. On the other hand, the
proxy needs to bridge the two Ethernet devices to be transparent. And to achieve this, all
Ethernet frames from one device have to be sent out via RAW Sockets on the other device.
The lter management of libPCap is useless, because all frames are neededfor bridging the
devices. In order to not break an ongoing communication, none of them should be ltered
out on the way from one Ethernet device to the other.
To buffer the TCP trafc, the proxy has to lter the bridged trafc for TCP packets. Filter
functionality has to be implemented into the proxy. Compared to the gure 3.2, all trafc
21
takes the way (1), (2), (3) from kernel to user space into the libPCap buffer and to the
callback of the proxy. Further, the trafc goes the way back via (4) into the kernel space via
a RAW Socket to the other Ethernet device. Only the TCP packets are duplicated into the
retransmission buffer of the proxy, which is symbolized by arrow (5).
3.3.2 RAW Sockets
A RAW socket is a socket that can be seen as a direct interface to the transport or network
layer. It passes a copy of a frame or packet directly to the user space, before it is processed
by the Ethernet stack or IP stack. It is also possible to send data like a TCP packet directly
to the wire without or only partially being processedby the IP Stack. Partially means to cal-
culate checksums in the headers for example. This is a very good possibility to implement
retransmission for the proxy, but lets focus back on capturing.
The current RAW socket interface is supported by the Linux kernel since version 2.2.x.
Also in version 2.0.x there is a very similar interface, but this obsolete and deprecated
now. Using RAW sockets for capturing was tested with kernel version 2.4.x on a Linksys
WRT54G
5
, an embedded Linux router with wireless interface, and on a PC with Linux ker-
nel version 2.6.x.
Normally a RAW Socket gets only the trafc that is destined to the MAC addresses of the
Ethernet devices owned by the proxy host. To implement the proxy functionality all trafc
on the wire needs to be processed by the proxy. Therefore the Ethernet devices have to be
set into promiscuous mode like described in 3.2.
Basically the data ow is very similar to an implementation with libPCap (3.3.1), but there
are no restrictions by a predened library like limited control of the buffer management.
On the other side everything such as protocol classication and buffering has to be imple-
mented by the proxy, which causes more processing for the proxy itself and therefore more
workload during development.
A proper design would be multi threaded and provide at least two threads. One capture
thread for each Ethernet device. This prevents polling each device alternately. Polling is a
type of busy waiting, which should not be used in software.
With multiple threads the access to shared information like the state or the presence of a
TCP ow has to be managed and protected, therefore mutexes are used. Multi threading
and mutexes were implemented since the early prototypes with the very handy POSIX
Threads library libpthread.
5
see products on http://www.linksys.com
22
Ethernet Devices
With Device Driver
Buffer in
Ethernet Stack
Copy with RAW Socket from one eth
(all traffic)
TCP proxy
Kernel Space
User Space
IP Stack
Netfilter
framework
Pointer
Pointer
1
2
Copy with RAW Socket to other eth
eth1 eth0
3
Capture Function
Pointer
(only TCP)
Copy with RAW Socket to other eth
(all traffic)
4
5
Figure 3.3: Dataow with RAW Sockets
On the gure 3.3 you can see that all trafc takes the way (1), (2) from kernel space to user
space directly into the buffer of the proxy. TCP and non TCP trafc is directly sent out on
way (3) via a RAW Socket to the other Ethernet device, to provide bridging and therefore
transparency of the proxy.
If the proxy classies the content of the Ethernet frame as a TCP packet, a pointer to the
current part of the buffer is forwarded on way (4) to the retransmission buffer of the proxy.
Further processing like TCP connection tracking is triggered.
For every other type of data trafc the part of the buffer, later (in 3.4.2) dened as chunk, is
freed again. If later a retransmission of a buffered TCP packet is needed, the proxy sends
the packet from the retransmission buffer on way (5) via a RAW socket to the Ethernet
device.
The main difference from the libPCap design is the optimized buffer management. Instead
of copying all trafc to the user space and copying TCP packets a second time from the
libPCap buffer to the retransmission buffer, only a pointer is passed to the other modules
of the proxy for further processing and perhaps a retransmission.
The buffer chunk is reused for non TCP packets and kept for TCP packets, which is also
a small performance enhancement. Compared to the libPCap design, exactly one copy
operation in user space is saved, but the problem stays. All the trafc has to be copied to
user space and ltered, classied or modied there.
23
3.3.3 Netlter Queue
Before some advantages of Netlter Queue design can be shown, some information on
the Linux Kernel Ethernet Bridge has to be given. This is important because the previous
designs had to implement its functionality by forwarding the trafc from one device to
another with RAW sockets.
3.3.4 Kernel Ethernet Bridge
Ethernet bridging aggregates two or more Ethernet devices to one logical device. Bridged
devices are automatically put into promiscuous mode by the kernel. As we already
know, this mode of operation tells the Ethernet device to forward all trafc to the device
driver. Not only the trafc which is destined to the own MAC address.
All trafc, which is not destined to the host of the proxy, is forwarded to the device with the
corresponding part of the Ethernet subnet. The bridge learns where to forward an Ethernet
frame from the Source MAC address (see gure 3.4) eld in the Ethernet header.
12 A1 B2 C4 D5 E6
Destination MAC Address
12 F1 12 23 34 56
Source MAC Address
08 00
EtherType
MAC Header
(14 bytes)
IP, ARP, etc.
Payload
Data
(46 - 1500 bytes)
00 20 20 3A
CRC Checksum
Ethernet Type II Frame
(64 to 1518 bytes)
(4 bytes)
Figure 3.4: Ethernet Type II Frame format
All unknown MAC addresses from the Source MAC address eld are stored with some ad-
ditional information like a timestamp and the source device in a hash table. This builds a
database of MAC addresses linked with the Ethernet device to reach the host.
The Ethernet bridge needs to process two cases:
Is a MAC address unknown, the Ethernet frame is forwarded to all other bridged
devices. This will normally produce a response from the destination host. With such
a response, the bridge is able to detect also the corresponding Ethernet device for the
previous unknown destination host. In the response there are source and destination
exchanged, compared to the rst Ethernet frame. The previous destination MAC
address can now be learned from the source MAC address of the response. Using
this logic, broadcasting to all other devices is needed only for one time.
24
Is a MAC address already stored in the hash table, the bridge knows where to for-
ward the Ethernet frame.
3.3.4.1 Netlter
The Netlter Queue interface
6
is a part of the Netlter framework. It is a user space library
providing an API to packets that have been queuedby the kernel packet lter. On transport
or network layer it can issue verdicts and/or reinjecting altered packets back to the kernel
space.
Using the fairly known iptables
7
tool fromthe Netlter framework, special lters rules can
be set. These lter rules decide which packets are passed to the user space. The possible
ruleset of Netlter includes every standard type of IPv4 or IPv6 trafc. For the proxy we
only need to install one rule, which matches against all TCP trafc. TCP is the only relevant
data for the proxy.
Protocol classication is applied already in kernel space by the Netlter framework. Other
trafc than TCP stays in kernel space and is not transferred to the proxy in user space. The
memory reservation and the copy operation to the user space for non TCP trafc are saved.
The task of pulling network trafc from one Ethernet device to another, which results in
being transparent, does not have to be untended in this design.
Due to the fact that ltering and protocol classication can be applied in the kernel space
by the Netlter framework, but controlled from the user space, it is possible to use the
build-in Kernel Ethernet Bridge which was described at the beginning of this section.
Netlter makes no difference between a physical network device and a logical bridge de-
vice. Rules can be set and applied to both of them. The two needed Ethernet devices are
aggregated by the Ethernet Kernel Bridge to one bridge device. For this reason it is also
enough to have only one thread for capturing, because there is only one device to capture
from. This makes most of the thread synchronization in the whole implementation a lot
easier and gains a bit more performance, because it saved some overhead.
Information about which is the incoming and which is the outgoing physical device of a
packet is also provided by Netlter Queue. This information is needed for retransmission
of the packet on the appropriate device.
6
http://www.netlter.org/projects/libnetlter_queue/index.html
7
http://www.netlter.org/projects/iptables/index.html
25
Kernel
Ethernet Bridge
With Device Driver
Buffer in
Ethernet Stack
Copy with Netfilter Queue
(only TCP traffic)
TCP proxy
Kernel Space
User Space
IP Stack
Netfilter
framework
Pointer
Pointer
1
3
Copy with RAW Socket to eth
(only if retransmission needed)
eth1 eth0
5
2
4
Figure 3.5: Dataow with Netlter Queue
For all trafc the need of forwarding to another Ethernet device is decided by the bridge.
If the decision is positive, the frame takes the way (1) on Figure 3.5 and is handed over to
the logical Ethernet device which represents the bridge. From here the frame is passed (2)
to the Netlter Framework. Netlter issues verdicts based on the installed lter rules and
grants (3) the forwarding to another device. In case of TCP trafc, the packet is copied (4)
to the proxy in user space. Netlter asks the user space application, the proxy, if the packet
should be forwarded or modied. Normally the answer is yes for forwarding.
Retransmission, if needed, is done with a RAW Socket (5) on the accordingly Ethernet
device. This design saves a few more copying operation between kernel and user space
than the other designs.
3.3.5 Conclusion of the Comparison
During development of the proxy all three methods for capturing were implemented. Just
in the order they were presented here. It is a bit like an evolutionary design with the goals
of xing issues and shorten/enhance the way from the wire into the proxy.
Capturing methods libPCap (3.3.1) and RAW Socket (3.3.2) are very similar except the
more effective buffer management of the RAW Sockets design. Technically both designs
26
use RAW Sockets for capturing, but libPCap has an abstraction layer above them to sup-
port different platforms and to make the handling easier.
The libPCap designs lags in the point of needing to have a second buffer in the user space
for the TCP trafc. Copying the incoming TCP trafc to the second buffer is saved in the
plain RAWSocket design, because trafc is directly captured into the retransmission buffer
of the proxy.
Both designs do protocol classication of the trafc completely in the user space, therefore
all the trafc has to go the way down from kernel to user space and back to kernel space.
Back to kernel space, because the trafc is forwarded to the other Ethernet device to imple-
ment bridging. But only a copy of the TCP trafc stays in user space for retransmission, all
other trafc is dropped after it was forwarded.
The Netlter design improves this by doing the classication with help of the kernel in
kernel space. Only TCP trafc has to go the way down to the user space and is buffered
there. And only in the case of retransmission or modication it has to go the way back up
to kernel space. It also removed some complexity, because there is only one capture thread
left for exactly one Ethernet device, the logical bridge device. Overhead for thread security
and synchronization is also saved.
The following table 3.1 is a resume of the needed copy and memory operations. The k
stands for kernel space and u for user space. buffer[u] describes the retransmission
buffer of the proxy in user space.
Design Description Copy operations Memory allocations
libPCap
k u k
or
k u buffer[u] k
2 or 3 3 or 4
RAW Sockets k buffer[u] k 2 3
Netlter Queue
k
pointer
k
or
(k
pointer
k) and (k buffer[u])
0 or 1 1 or 2
Table 3.1: Comparison by counting copy and memory operations
Finally it can be decided that the Netlter Queue design is the best choice in focus of
performance. It saves many data transfers to the user space and this is very important for
higher bit rates.
The decision is based on reading source code of the Linux kernel, Netlter Framework,
libPCap and implementing each design during the development of the proxy.
27
3.4 Module Design of the Proxy Implementation
3.4.1 Operating System requirements
For actually setting up the proxy, the Linux Kernel needs to have the 802.1d Ethernet
Bridging support enabled and the Netlter Framework for IPv4 must be enabled.
The Kernel Ethernet Bridge has to be set up and the lter rule installed, which applies to
TCP trafc, before any packet can be captured.
To control the 802.1d Ethernet Bridging extension in the kernel, the bridge utilities
8
are needed. With the following shell commands the needed bridge (br0) starts to forward
Ethernet frames from eth0 to eth1 and vice versa.
br c t l addbr br0 # c r e a t e b r i dge
br c t l addi f br0 et h0 # add e t h0 t o b r i dge
br c t l addi f br0 et h1 # add e t h1 t o b r i dge
i f c onf i g br0 up # s t a r t t he d e v i c e
At this point the trafc can pass the proxy host fully transparent, but without any lter-
ing or changing of the trafc. The user space proxy application must be started with the
following command.
# <bri dge device > <device1 > <device2 >
. / tcpproxy br0 et h0 et h1
Any further initialization is done during the startup of the application, which is described
in the following sections for each module separately.
3.4.2 Module: Buffer Manager
The proxy implementation has to buffer every TCP packet till it can be assumed that it has
reached its destination host. Therefore memory in the user space has to be allocated.
Normally this is done with a void *malloc(size_t size) system call which returns a pointer
to the allocated memory. After a packet has reached its destination the memory could be
deallocated with void free(void *s) systemcall.
8
http://www.linux-foundation.org/en/Net:Bridge
28
But each system call gives the control during the runtime of the application back to the
operating system till the system call returns. This includes a context switching from the
user space application to the kernel. Each context switching operation takes time, because
registers and states are saved.
If allocation and deallocation is done for each and every TCP packet separately this could
lead into a performance problem, because of the high amount of context switches. There
would be two context switches for each buffered packet.
This issue is avoided by the Buffer Manager by allocating a large memory block at the
initialization time. The large memory block is divided into small pieces. Each piece is
later used to store a TCP packet. Additional, a data structure specially designed to be ex-
changed between the different modules of the implementation is used. Such a data struc-
ture is called chunk from now on.
A chunk carries all important information of a packet. The packet itself as a pointer to the
right piece in the large buffer memory block and some additional management informa-
tion.
Normally a memory block is retrieved from the operating system and returned if the data
stored in this block is not needed any more. But if all packets are stored in one large mem-
ory block, this block cannot be returned during runtime to the operating system. All the
data would be lost and not only the data from the one packet that is not needed any more,
therefore the memory block is only returned during termination of the application.
The Buffer Manager has to keep track which chunk is used at the moment. A linked list
which stores pointers to all currently unused chunks is used and is accessed like a FIFO. A
First In, First Out list.
FIFO
Chunk 00 Chunk 01 Chunk ... Chunk n
RETURNED
by a module
RETRIEVED
by a module
Figure 3.6: Logical structure: Linked list used as a FIFO
During initialization of the Buffer Manager all chunks are added to the FIFO list.
If another module needs a free chunk to store data, the Buffer Manager is asked. As you
see on Figure 3.6, the Buffer Manager always removes the rst element from the list and
gives the pointer to the chunk to the demanding module. A returned chunk is appended
at the end of the list.
29
For retrieving a free chunk from the list, the implementation just follows and backups the
FIRST pointer (see Figure 3.7 as reference for the pointer names) to Chunk 00. At the end
of the retrieving operation, the backup pointer to Chunk 00 is passed to the demanding
module. Chunk 01 becomes the new rst element of the list, therefore the value of the
NEXT pointer from Chunk 00 to Chunk 01 is copied into the FIRST pointer.
Linked List
Chunk 00
NEXT
Chunk 01 Chunk ... Chunk n
NEXT NEXT
F
I
R
S
T
L
A
S
T
Figure 3.7: Implementation: Linked list at initial state
After the retrieval of Chunk 00 the linked list looks like Figure 3.8 and a pointer to
Chunk 00 can be passed to the demanding module.
Linked List
Chunk 00
Chunk 01 Chunk ... Chunk n
NEXT NEXT
F
I
R
S
T
L
A
S
T
Figure 3.8: Implementation: Linked list after retrieval of a chunk
For returning Chunk 00 back to the list, the implementation just follows the LAST pointer
to Chunk n. The NEXT pointer of Chunk n is adjusted to Chunk 00. Finally, after
adjusting the LAST pointer to Chunk 00, the return operation is nished. Figure 3.9
shows the order of list elements and pointer adjustments after retrieving and returning
Chunk 00.
As a small summary, it can be said that simply by only using the FIRST pointer for re-
trieving an element and using only the LAST pointer for returning an element, no mutex
is needed to protect this list from asynchronous manipulation. This prevents to block an-
other thread if two or more simultaneous threads want to retrieve a chunk.
Also having a LAST pointer enhances the returning of a chunk a lot. Without this pointer,
the whole list needs to be iterated to nd the last element and add an element at the end of
the list. With the LAST pointer it can be added directly.
30
Linked List
Chunk 01
NEXT
Chunk ... Chunk n Chunk 00
NEXT NEXT
F
I
R
S
T
L
A
S
T
Figure 3.9: Implementation: Linked list after returning the chunk
Simultaneous retrieving and returning is possible without using mutexes for thread syn-
chronization, if there is more than one element in the list. This is an improvement if some
processing of the chunks happens in parallel to the capturing of packets. That is the case
for this proxy implementation.
3.4.3 Module: Netlter Queue Interface
Capturing the TCP trafc is done in this module. Therefore a callback is registered with
the libnetlter_queue library. The Linux kernel keeps a linked list with callbacks. With the
libnetlter_queue library it is possible to create an entry in this list.
Always if some packet matches a Netlter lter rule and QUEUE is present as the action
for this rule, these callbacks are called one after another in registration order.
The TCP trafc lter rule is installed during the initialization of the Netlter Queue Inter-
face module. This happens after the registration of the callback. If the proxy application is
terminated by some reason, the rule is automatically deleted again.
This is very important, because if a lter rule with QUEUE as action is installed and
matches a packet, the Netlter Framework tries to ask a waiting application in user space
if this packet should be dropped. But if no application is present to tell the framework that
a packet has to be accepted, it is dropped.
The following box shows the rule which is installed by the proxy.
# command t o i n s t a l l t he TCP f i l t e r r ul e
i pt abl e s A FORWARD p t cp j QUEUE
With this rule all forwarded (-A FORWARD) TCP trafc (-p tcp) is handled by the
Netlter Queue (-j QUEUE). The packets of interest are in the FORWARD chain of
Netlter, because the Kernel Ethernet Bridge is used.
31
The libnetlter_queue library supports three copy modes to transport data fromkernel space
to user space.
NFQNL_COPY_NONE - Do not copy any data
NFQNL_COPY_META - Copy only packet meta data
NFQNL_COPY_PACKET - Copy entire packet
NFQNL_COPY_METAwould be enough to realize a connection tracking in user space and
would highly reduce the amount of data that has to be copied to the user space. Only the
IP and TCP headers are copied to user space in this mode.
This is enough to identify and track a TCP ow, but for the proxy implementation only
the NFQNL_COPY_PACKET mode is the right choice. In this mode the whole packet is
transferred to the user space and can be buffered there by the proxy for retransmission.
To capture a single packet a blocking function of the libnetlter_queue library must be called.
The process of capturing is dened in eight steps:
1. Fetch a free buffer chunk from the Buffer Manager
2. Request the next packet with libnetlter_queue (blocking)
3. Registered callback is triggered
4. Packet is passed to the TCP Connection Tracker
5. Packet is passed to the Connection Manager
6. Return to callback
7. Return from Request the next packet
8. Go to step 1.
Step three does not only retrieve the TCP packet from the kernel space and stores it in a
buffer, also its length and other needed information for further processing is gathered
and stored. Pointers to the IP and TCP headers are calculated and set and these pointers
are stored in the chunk data structure as ip_header and tcp_header, which is shown in the
following box Chunk data structure.
32
Chunk data structure
s t r uc t buf_man_cap_buf_chunc
{
s t r uc t l i s t _he ad l i s t ; / l i n k e d l i s t management /
char buf f er ; / p o i n t e r t o b yt e a r r a y wi t h IP pa c k e t /
unsigned i nt l engt h ; / l e ngt h o f whol e pa c k e t /
s t r uc t devi ce_i nf o out _devi ce ; / out put d e v i c e /
s t r uc t i p i p_header ; / p o i n t e r t o IP he a de r /
s t r uc t tcphdr t cp_header ; / p o i n t e r t o TCP he a de r /
unsigned i nt t cp_seq ; / TCP s e q ue nc e number /
i nt nfq_i d ; / Ne t f i l t e r Queue ID /
enum chunc_st at e s t at e ;
/ ( t v_s e c = s e conds , t c _us e c = mi c r o s e c o nds ) /
/ s e t a t r e c e i v e AND s e nt /
s t r uc t t i meval timestampt ; / ge ne r a l t i me s t amp /
} ;
This is done in preparation especially for the TCP Connection Tracker module but also for
all other modules that need direct access to these headers. The Netlter Queue Interface
module is the rst module in the processing chain of the proxy implementation, therefore
it makes sense to set these pointers here.
Additionally the timestamp is set to the current time, which is the capture time with a
resolution of microseconds. This time is needed to calculate the Round Trip Time (RTT) of
a packet. RTT is calculated later in the Connection Manager.
To make retransmission possible, the outgoing physical device of a packet must be known.
The outgoing physical device is retrieved from Netlter and stored in out_device of the
chunk structure. Referencing a packet while using the Netlter Queue interface is done
with a Netlter ID. Later in the TCP Connection Tracker module or in the Connection
Manager module it is decided if a packet should be dropped, forwarded or modied and
therefore the Netlter ID is needed to reference a specic packet. This is achieved by stor-
ing the Netlter ID in nfq_id and providing three functions for a better handling of pack-
ets. Each function takes a pointer to a chunk, which carries the packet, as parameter. The
33
Netlter ID is automatically taken from the chunk data structure by these functions.
They are called:
i nt ne t f i l t e r _ s i g na l _ a c c e pt ( s t r uc t buf_man_cap_buf_chunc chunc ) ;
i nt ne t f i l t e r _s i gnal _ac c ept _but _modi f i ed (
s t r uc t buf_man_cap_buf_chunc chunc ) ;
i nt ne t f i l t e r _s i gnal _dr op ( s t r uc t buf_man_cap_buf_chunc chunc ) ;
The names are somewhat self explaining, but lets leave some words about them.
netlter_signal_accept and netlter_signal_drop just signals Netlter to forward or drop the
packet. netlter_signal_accept_but_modied awaits the modied packet in the chunk and
passes the modied packet to kernel space, which is then forwarded.
If a packet arrives, the values of source port, destination port, SEQnumber and ACK num-
ber is stored in Network Byte Order.
The sequence number is looked up by several modules of the implementation. But for
comparing the sequence number to another value, it has to be in Host Byte Order. Host
Byte order differs on x86 hosts from Network Byte Order. The Byte Order is reversed.
Reversing the Byte Order each time a value is compared is less effective, if this happens
more than one time. To enhance this a bit, the value of the sequence number is reversed by
the Netlter Queue interface module and stored in tcp_seq of the chunk data structure.
After all important information is gathered and stored into the chunk data structure, the
chunk is handed over to the TCP Connection Tracker module.
3.4.4 Module: TCP Connection Tracker
The TCP Connection Tracker module identies a TCP ow and looks up the correspond-
ing management data structure. If the ow is unknown to the proxy, a new management
data structure is created and initialized.
Tracking of TCP ows is stateful and this means a state per ow is maintained and only
transitions accordingly to the TCP standard are allowed. Packets that are violating a state-
ful transition are ignored. Only new ows are picked up for tracking during their connec-
tion establishment. Picking up an established ow is possible with the implemented state
machine, but not yet implemented by the rest of the proxy.
34
After identication of a ow, the newly arrived packet is added to a per-ow cache. The
packet and the corresponding management data structure is handed over to the Connec-
tion Manager module.
Hashing is used for faster lookups and identication of a ow. Implementation of the pri-
mary hash function was taken from the Linux kernel. It is called jhash
9
and was developed
by Bob Jenkins. Attributes of jhash are fastness and good mixing of the input. It was also
published in the famous The Dr. Dobbs
10
computer magazine.
During initialization of the module, an array which is used as hash table is created.
Elements of the hash table are called buckets and each bucket is a linked list and used for
storing pointers to management structures of TCP ows.
A primary and a secondary hash function are used to distribute data in the hash table.
The primary hash function is jhash and the secondary function is a modulo operation with
the size of the hash table as value. For the size of the hash table a prime number is chosen.
Using prime numbers for the hash tables size with a modulo operation as secondary hash
function is a good idea, because it minimizes clustering in the hashed table (see 3.4.4.1).
The TCP State Machine is imported from the Netlter Framework and described further
in section 3.4.6.1. Netlter has a TCP State Machine implemented to use stateful packet in-
spection for the Linux kernel packet lter. Reasons for taking or adapting source code from
other implementations are - these implementations have proofed their value and many
eyes have looked on the source code. It is not wise to always reinvent the wheel each time
and do probably the same mistakes other people already did in the past.
3.4.4.1 Prime Numbers for the Secondary Hash Function
Normally we tend to use a value of 2
n
as size for arrays, because we like and know these
numbers. A programmer has a good impression to the proportion of these values. He can
compare with the sizes of memory or harddisks of a PC.
Lets call the size of the table S and the result of the primary hash function H.
The secondary hash function is then (H mod S).
Then what makes (H mod S) a good distributing hash function?
Let the size S be divisible by 2, remember the 2
2
which matches this specication. Then
any time H is divisible by 2, (H mod S) will be divisible by 2. In opposite any time H is
not divisible by 2, (H mod S) will not be either.
9
http://burtleburtle.net/bob/hash/
10
http://www.ddj.com/
35
This means, by applying the secondary hash function, even numbers hash to even indexes
and odd numbers hash to odd indexes.
If S was also divisible by 3, then multiples of 3 would hash to multiples of 3 and non-
multiples of 3 would hash to non-multiples of 3.
We would expect that half the numbers to be even and the other half to be odd.
Unfortunately, this is very unlikely because a sample set is more likely to be biased. Espe-
cially the smaller it is. This results in the problem that the secondary hash is perpetuating
this fact instead of reducing it by mixing the values.
Therefore in general it is better to use a prime number for the size of a hash table, because
it has only itself as factor.
3.4.4.2 Identify a TCP ow
A ow or connection is identied by a tuple of source IP (srcIP) and destination IP (dstIP)
address and especially by the source port (srcPort) and destination port (dstPort) number.
ow tupel = (srcIP, dstIP, srcPort, dstPort)
For IP version 4 (IPv4) the srcIP and dstIP are 32 Bit values. srcPort and dstPort are 16 Bit
values.
Sample for an IP address:
IP address (4 bytes) binary as octets unsigned 32 bit integer
130.94.122.195 10000010 01011110 01111010 11000011 2187229891
Normal PCs are working mostly in the x86 mode, which implies a binary compatibility
with the 32-bit instruction set. Each machine instruction can take a 32-bit operand. Work-
ing with 32-bit operands on PCs in x86 mode is the most effective usage of this hardware,
therefore the proxy implementation treats and works with IP addresses and TCP ports as
unsigned 32-bit values. This makes, for example comparing, more effective than compar-
ing four bytes separately (four bytes like the normal notation).
The proxy sees packets for one ow in two directions. Original sender to original receiver
and the responses. This means srcIP and dstIP are exchanged for the response direction.
Also srcPort and dstPort are exchanged.
36
This is a problem, because all packets for origin direction would produce another hash
value than the reply direction. Two hash values would imply to search two hash buckets
for the corresponding management data structure. By maintaining two management struc-
tures, one for each direction of a ow, it is also more difcult to keep track which packets
are already acknowledged and can be deleted from the proxy buffer.
Feeding the values in a special way into the jhash function, solves this issue. The jhash
function, which is used as primary hash function, is designed to take one, two or three
unsigned 32 bit values as arguments.
For the proxy implementation, the three argument version is chosen. srcIP and dstIP are
the rst two arguments with 32 bit, each. The one IP address of srcIP and dstIP which has
as unsigned 32 bit value a higher value is taken as rst argument for jhash.
The third 32 bit value is calculated by adding the two 16 bit values from srcPort and dst-
Port. Adding two 16 bit values results as maximum in a 32 bit value, which is exactly the
upper bound of the third argument for jhash.
Applying this order to the arguments does the rst half of the trick, because this produces
always the same input arguments for jhash and therefore the same output for packets of
both directions. Second half is summing the two TCP port, because addition is commuta-
tive and adding the two 16 bit values from srcPort and dstPort produces always the result
value. Regardless of which direction the packet is for.
Taking only srcIP and dstIP as input for the primary hash function would be enough. By
also taking the srcPort and dstPort as references, this gives a better distribution over the
hash table. Especially if one IP host has many simultaneous connections on different TCP
ports. And better distribution results in shorter lookup time.
The following sample C code fragment shows how easy the primary and secondary hash
function can be implemented. SIZE represents the hash table size and HASH_RND a con-
stant, which is demanded by jhash to do a better mixing.
por t _val = s r cPor t + dst Por t ;
i f ( ( unsigned i nt ) s r cI P > ( unsigned i nt ) dst I P )
{
ret urn j hash_3words ( srcI P , dst I P , port _val , HASH_RND) % SIZE ;
}
el se
{
ret urn j hash_3words ( dst I P , srcI P , port _val , HASH_RND) % SIZE ;
}
37
TCP Connection Tracker module executes mainly the following steps:
1. Get chunk from Netlter Queue Interface module
2. Generate hash value from packet headers (IP and TCP like described)
3. Search hash bucket identied by hash value
4. Identify (or create new) management data structure in hash bucket
5. Add chunk to the per-ow retransmission buffer
6. Pass chunk and management data structure to Connection Manager
7. Go to 1.
All steps till four are covered by the previous parts of this section.
Step four has two different behaviors.
[d
a
ta
]
a
c
k
=
y
+
1
s
e
q
=
x
+
1
s
y
n
a
c
k
=
x
+
1
s
e
q
=
y
s
y
n
s
e
q
=
x
Server Client 3-way
Handshake
Figure 3.10: TCP Handshake
For known TCP ows it directly goes to step ve.
Packets from unknown connections are only accepted if
the packet is the rst packet of a 3-way connection hand-
shake (see gure 3.10). New management data structures
are only created and initialized for these packets.
On the gure the packet is marked with the arrow from
Client to Server and syn seq=x as label.
Ignoring other packets is not an implementation fault, but
a design decision. Because this means picking up an estab-
lished ow and not only new ones. And therefore some
information can only be guessed. An example would be,
which TCP options are supported by the communicating
hosts. But this information is very important for TCP
Snoop. TCP Snoop acts differently if the TCP SACK op-
tion is not supported by the hosts.
The implementation and the implemented TCP state machine is prepared for picking up
connections. This functionality could be upgraded in future work but is not necessary for
normal operation.
A management data structures carries one sub data structure for each direction of a ow
with the following information. This makes two for a ow.
38
The sub data structure carries the following information:
Count of retransmitted packets (for statistics)
Count of duplicate ACKs (to detect lost packets)
First sequence number seen for this direction (to detect wrap arounds)
Last sequence number seen for this direction (to detect duplicate packets)
Last acknowledge number seen for this direction (to detect new ACKs)
Last end (Last sequence number + length of TCP data)
Last TCP window advertisement seen for this direction
RTT (for retransmission timers)
TCP Window scale factor
Supported TCP options
These sub data structures must be initialized for each new connection.
After that or if the connection is already know, the TCP Connection Tracker continues to
step ve (see page 38). Step ve is described in the next section 3.4.4.3.
Step six forwards the chunk and management data structure to the Connection Manager.
The Connection Manager completes the stateful tracking with the raw information gath-
ered by the TCP Connection Tracker and its processing is described in 3.4.6.
3.4.4.3 Retransmission Buffer
Most of the buffered packets are not retransmitted. Only a few are retransmitted. This
implies a quick lookup and release from the buffer, if a packet was acknowledged by a
receiver.
As we already know, packets are only exchanged between functions and modules of the
implementation by passing pointers to chunks. The chunks encapsulate the pointer to the
actual packet with some additional information about the packet.
An intuitive attempt to lookup and free all acknowledged packet would be to iterate a list
39
with all currently buffered packets. But looking up packets from other connections is very
inefcient, therefore the proxy implements a per-ow cache. It consists of two ordered
linked lists that store pointers to chunks, ordered ascending by the TCP sequence number
of the packets. One list for each communication direction.
Adding the packet to the per-connection cache is nearly the same effort as adding them
to a global list, with all currently buffered packets. Communication direction and corre-
sponding management structure is already known at step six (see page 38). Based on these
facts, it is just the effort of adding an element to a linked list. Except applying the order to
the list. But the ordinary case is that packets have a rising sequence number, like dened
in the TCP standard. This leads to appending at the end of the list in the ordinary case,
because packets are sorted by sequence number already. Only delayed packets have to be
t in sorted and cause some extra work.
The ascending order makes freeing of acknowledged an easy job. Just iterating the accord-
ingly list and free packets from the beginning till a packet with a sequence number greater
or equal than the acknowledge number.
As summary, packets are grouped by ow, additionally by direction and ordered by se-
quence number. This implements an efcient lookup, which leads to a good performance.
3.4.5 Module: Timer Manager
The Timer Manager is mainly a helper module for the Connection Manager. During ini-
tialization of the module, it installs an operation system callback. The operating system
sends at a given interval a timer signal, which is caught by the signal handler of the proxy
application and forwarded to the timer the callback.
Intervals, types, count and organization of timers are mainly based on ideas from the BSD
TCP/IP Stack [Stevens, chapter 25] and a TCP/IP stack for embedded systems.
The BSDTCP/IP Stack was used as a reference. It is very similar to the Linux TCP/IP Stack
and a good source for information about TCP timers. There is a fast and a slow timer, like
in BSD TCP/IP Stack. The fast one is triggered every 200 milliseconds and the slow one
every 400 milliseconds. Compared to the BSD Stack this differs in the value for the slow
timer. BSD uses 500 milliseconds as interval for the slow timer.
The reason for using 400 milliseconds in this implementation, is a more efcient way to
implement the timer handling in the user space. For an application it is only possible to
install one timer signal. This would mean to have only one interval. To solve this, the
proxy installs a 200 milliseconds timer and creates the 400 milliseconds timer by doubling
it. In the implementation an alternating function which creates a 0, 1, 0, 1, 0, ..., 1, 0 sequence
is used and the output value is checked for every timer signal for being zero or non zero.
40
This function is very simple and dened as follows:
value = value XOR 1
A check for zero and the XOR can be translated by a compiler to very few machine in-
structions, which makes this very efcient. The XOR instruction and a conditional jump
instruction for the value check. There is a big need that this is efcient, because the timer
handler is run every 200 milliseconds.
The Timer Manager offers a simple API to the other modules for creating a timer event.
There are predened timer actions from which the other modules can choose from. Cur-
rently implemented actions are:
Retransmission, which retransmits a TCP packet.
Timewait, which is used during a special state of the implemented TCP state machine.
Timeout, which is used to detect timed out TCP ows (no data sent any more).
By utilizing a general purpose pointer in the data structure, that stores the information
about a timer event, all actions can use the same data structure. This simplies to extent
the Timer Manager with more actions and makes the management of the different timer
event types more efcient. Casting from the general purpose pointer to the actual data
type for the current action is only done if the action has to be triggered. This simplies and
shorts the loop, which checks if an event action should be triggered.
The timer events are stored in an ordered linked list. One list for each of the two timers.
Events are ordered by their timestamp, which denes when the timer action should be
triggered. Ordering the events raises the effort to create a new entry in the list, because the
right place in the list has to be found. But under the assumption that mainly retransmission
timer events are created, this leads in most cases to a position at the end or very near to the
end. Main reason for this is the sequential processing of the TCP packets and the timeout
calculation. Timeout is dened by current time plus a multiple of the round trip time of a
TCP packet, therefore the timeout values grow sequentially.
Insertion into the list is done by starting at the end and iterating backwards till the correct
position. Iteration from the end is an efcient way under the previous assumption.
Searching for events that has to be triggered is done by iterating from the start of the list
forward until the current time as value is larger than the timestamp value of the currently
checked event. This is also very efcient, because only events that has to be triggered plus
one are iterated. Future events dont have to be checked.
41
3.4.6 Module: Connection Manager
After the TCP Connection Tracker module looked up the corresponding management
structure for a packet both is passed to this module, the Connection Manager module.
The functionality of the module can be described as a lightweight TCP stack or the TCP
connection tracking part of a stateful rewall.
As a reference for a standard TCP stack and how a TCP stack works the BSD implemen-
tation was used. The implementation of this stack is described in the book TCP/IP Illus-
trated II. The Implementation [Stevens]. Ideas for connection tracking are also taken from
the Netlter implementation and a paper from Guido van Rooij [Rooij]. The following text
describes connection tracking for TCP Snoop and not general purpose connection tracking.
Main tasks of this module are updating the state of a ow, calculating the RTT if possible,
installing the retransmission timers, detecting gaps in the sequence numbers and detecting
duplicate ACKs. During the development of the TCP proxy there was the question to have
a management structure for each communication direction or only one for the whole TCP
ow. For the point of view from the TCP Connection Tracker module it makes sense to
have two management structures, because it simplies the hashing. Management struc-
tures for different communication directions are hashed to different hash buckets because
of the simplied hashing. Also caching could be realized with the simplied hashing, be-
cause for one direction we only need to lookup the management structure for this direction
to nd a packet for retransmission.
But by ignoring the communication direction of a packet we get into serious trouble. It
is not sufcient to store each and every TCP packet, because the buffer of the TCP proxy
would just overow. Acknowledged packets have to be released from the buffer again.
These ACKs are sent by the receiver of a packet, which means they are sent from the op-
posite direction.
For the design with two management structures per ow, this means an additional lookup
for the second management structure to check if some previously received packet was
acknowledged and can be released from the buffer. ACKs can only be found in the man-
agement structure for the opposite direction. This means in general, both management
structures are looked up for each new packet of a ow. Maintaining only one management
structure and having a slightly more costly hash function makes sense, because one lookup
should always be cheaper than two lookups.
An additional aspect which is also positive for keeping only one management structure
is that state transitions of the state machine for a specic ow can and must be triggered
by trafc from both directions. The ow itself can have only one state, therefore this im-
42
plementations maintains only one state in one structure which corresponds to the TCP
standard. There is no need to store the state twice or even maintain two different states.
The processing of a single packet can be described as follows:
1. Calculate an index value from TCP ags for the state machine
2. Determine new state of the ow with the index value and state machine
3. Check SEQ number of the current packet
4. Check if ACK is present
Check if it is a duplicate ACK
Calculate RTT if possible
Release acknowledged packets from cache
5. Update state and other values in the management structure
6. Install retransmission timer
7. Signal Netlter to forward packet
8. Return control to capture module
3.4.6.1 Stateful tracking
Firstly, for realizing a stateful connection tracking a representation for the state of a TCP
ow is needed. This is implemented by a numeric value which represents the current
state of a nite state machine. The state machine used in this implementation is shown
on gure 3.11 on page 47. It is a simplied version to get an easier overview. Simplied
means that the shown arrows for state transitions with the same attributes but opposite
communication direction are combined to one arrow.
Transitions to other states are triggered by ags in the TCP header (see gure 2.1 as refer-
ence). Flags which can lead to transitions are Synchronize (SYN), Reset (RST), Acknowl-
edge (ACK) and Finish (FIN).
Every time a new packet arrives its ags are processed and an index value of a possible
43
transition is calculated from them. The index value and the direction of communication
are fed as input into the state machine to determine the new state of a ow. Transitions
depend on communication direction and the ags of a packet.
To determine the index value for a transition the relevant ag or ag combination is asso-
ciated with a numeric value. The numeric value of the rst pattern that applies to the ags
of the current packet is used as index value.
These patters are in sorted order:
1. RST ag present
Sent if one of the communication hosts wants to reset the state of the TCP ow.
2. SYN ag and ACK ag present
Sent with the second packet of the 3-way-handshake
3. SYN ag only
Sent with the rst packet of the 3-way-handshake
4. FIN ag present
Sent if one of the communication hosts wants to close the TCP ow.
5. ACK ag present
Normally present if the corresponding host wants to acknowledge a packet.
Next, the SEQ number of the current packet is checked. This is done to detect gaps in the
sequence numbers and out of order packets.
To keep track which is the next following SEQnumber, the proxy calculates for each packet
the sum of the current SEQ number and the size of the packet payload and stores it into
the management structure as a variable called last_end. When the next packet arrives
last_end can be compared with the actual SEQ number.
If both values are equal everything went ne. If the SEQ number is greater than last_end
there was a packet lost and a gap in the SEQ numbers is detected. For data trafc from
the wireless part of the network a retransmission with TCP SACK is triggered. In this case
and if the detected ow also supports TCP SACK, a new packet with no payload but the
accordingly SACK information is created and sent to the corresponding MH. It is not sent
as piggyback with the next packet to minimize the delay.
If the ACK ag is set for the current packet, the ACK number has to be processed. This
is needed to detect duplicate ACKs, calculate RTT and release packets from the buffer if
they reached their destination. Duplicate ACKs can be detected very easy, because the
proxy only checks if the current ACK number is equal to the last ACK number which
44
is still stored in the management structure. The amount of duplicate ACKs are counted
and also stored in the management structure. If a successful ACK is detected this counter
is reset to zero. How the RTT is calculated is described in the next section 3.4.6.2. For
releasing acknowledged packets from the buffer, the proxy iterates for buffered packets
for the opposite communication direction till the SEQ of the buffered packet is greater or
equal to the ACK number of the current packet. This can be done very efcient, because
the linked list with the buffer chunks is sorted by the SEQnumbers of the buffered packets.
Iterating the whole list is only needed if all buffered packets can be released.
With the current RTT value for the current communication direction a retransmission time-
out (RTO) is calculated. RTO is calculated by the simple formula RTO := t +4 RTT with
t as current time.
Finally, all gathered values and other information is stored into the management structure
of the current ow. There are sub-structures for communication direction dependent infor-
mation such as ACK or SEQ numbers. A retransmission timer with the calculated RTO is
installed and via a function of the Netlter Queue Interface (see page 34) it is signaled to
forward the packet. After that the Ethernet bridge forwards the packet and it can reach its
destination.
3.4.6.2 Round Trip Time calculation
RTT denes the runtime a packet needs to reach its destination plus the runtime for its
acknowledgment. The runtime for an acknowledgment is also a packet runtime, because
the acknowledgment is sent as a separate packet or along with another packet.
For calculating RTT passively on the TCP proxy, every captured packet needs a timestamp.
With a timestamp on each and every through passing TCP packet, a packet and also its
acknowledgment are getting a timestamp.
If the Connection Manager module detects an acknowledgment it tries to release the ac-
knowledged packet from the buffer. Shortly before releasing the packet, it looks up the
timestamp of the origin packet and the timestamp of the acknowledgment packet. By cal-
culating the difference of the two timestamps, the RTT of a packet from the TCP proxy to a
host is determined.
RTT values are stored separately for each communication direction of a ow, because the
bandwidth of the wireless link is normally slower than the wired link which creates an
asymmetry. For the faster link the RTT values are probably lower and therefore it makes
sense to store two RTT values for a ow.
45
From the perspective of one of the communication hosts, the RTT would be the sum of the
two separate values stored by the TCP proxy.
The TCP proxy implementation uses a simple algorithm to determine the RTT. If a retrans-
mission is in progress the RTT calculation is paused to eliminate difculties. For example,
during a retransmission a delayed acknowledgment could be received and create a very
low RTT value because the retransmitted packet just got away.
Every new RTT value is smoothed by 25% to be more resistant against uctuations and
measurement errors. This is done by the formula
RTT = (RTT
old
(1 p)) + (RTT
new
p)
with p = 0.25 as smoothing value.
Only acknowledgments which refer to one packet are used for RTT calculation. Cumula-
tive acknowledgments are ignored.
The rst chance to calculate a RTT value for both directions is the 3-way handshake of
TCP. A SYN packet without any payload is sent by both parties and gets acknowledged
by both parties to establish a connection. These values are taken as initial values for RTT
calculation. The following ones are smoothed by the previously announced formula.
46
NONE
SYN_SENT
Org. SYN
ESTABLISHED
Org. ACK
CLOSE
Org. FIN, ACK, RST / Repl. FIN, ACK, RST
Org. SYN Org. RST / Repl. RST
Org. SYN
SYN_RECV
Repl. SYN_ACK
Org. RST / Repl. RST
Repl. SYN_ACK
Org. ACK
FIN_WAIT
Org. FIN / Repl. FIN, ACK
Org. ACK
Org. FIN / Repl. FIN
CLOSE_WAIT
Org. ACK
LAST_ACK
Org. ACK
TIME_WAIT
Org. ACK
Org. SYN
Org. FIN, ACK /
Repl. FIN, ACK
Figure 3.11: Simplied TCP State Machine Diagram from the Connection Tracker
47
4 Evaluation
4.1 Testing
The software development with the corresponding testing was divided into two stages.
Step one (see 4.1.1) was to build a solid packet capturer with a stable working TCP
connection tracking. This step is very important, because if buffering of packets fails
it breaks the whole software. Also connection tracking must work reliable, because
it produces all the information that is necessary to make the proxy behave as a TCP
enhancer.
Step two (see 4.1.2) was mainly about testing the Connection Manager with the im-
plemented TCP Snoop behavior and the processing of the gathered information from
the connection tracking.
For the stress test a tool called iperf
1
was used. Iperf was developed by NLANR/DAST for
measuring maximum TCP and UDP bandwidth performance. In this case only the TCP
measurement functions were used.
The MH, FH and the proxy host for the measurements were virtual machines. Each with
one CPU at 3 GHz, 384 MByte RAM and at least one virtual 1Gbit/sec Ethernet interface.
Two Ethernet interfaces for the proxy host and one Ethernet interface for MHand FH, each.
For simulating a wired and a wireless network, two virtual Ethernet subnets were utilized
to connect the MH and the FH with the proxy in between. The virtual Ethernet subnets
looked physically divided for the virtual machines.
The PC hosting the virtual machines was equipped with a dual core CPU at 3 GHz and
2 GByte of RAM. It was also used as a monitoring instance for the two different Ethernet
subnets by using Wireshark
2
as a trafc analyzer. Analyzing both subnets revealed instantly
which packets were lost or retransmitted and also other inconsistency in the trafc.
1
http://sourceforge.net/projects/iperf
2
http://www.wireshark.org/
48
4 Evaluation
4.1.1 TCP Connection Tracker
Figure 4.1 shows the setup for the initial tests of evaluating the TCP Connection Tracker.
These were assisted with a tool called netcat
3
, which generates single TCP ows from one
host to another.
TCP Proxy
netcat client
Router
netcat server
Figure 4.1: Testbed for initial TCP connection tracking test
A netcat server and client were used to establish and release single TCP ows. This vali-
dated the implemented TCP State Machine and the tracking itself. By creating these simple
test cases with netcat, the conformance of the TCP State Machine with its state transitions
accordingly to the TCP standard was veried.
To make the testing as realistic as possible, the setup was extended by introducing a di-
rect Internet connection (see Figure 4.2) and the used test trafc for this scenario was real
life trafc. Trafc was generated by surng the Internet, fetching emails and using instant
messengers. This revealed and helped xing several smaller bugs in the software quickly,
especially in the Buffer Management module due to the uncovered trafc dynamics and
new scenarios. Similar to the previous test, the operations of the TCP connection track-
ing by the proxy were validated by running Wireshark as trafc analyzer in parallel for
performing cross checking of the detected ows with their states.
Figure 4.2 shows the TCP proxy which ran on a virtual machine in parallel to the physical
host. The physical host produced the trafc and the Ethernet devices of the virtual and
physical host were bridged together. All trafc from or to the physical Ethernet card of the
physical host was also forwarded to the Ethernet card of the virtual machine, through of
the bridged devices.
3
http://netcat.sourceforge.net/
49
4 Evaluation
TCP Proxy
PC
Internet
Router
Figure 4.2: Testbed for TCP connection tracking test
Main objectives of this testing were to ensure the conformance to software design, general
stability of the proxy, correct and working buffer management, the TCP connection track-
ing and communication between the different software modules.
To do this, stress tests and long runtime tests were performed:
Stress to the proxy was applied by downloading big les at higher bitrates or/and
opening many simultaneous TCP connections. Focus of these tests was general sta-
bility.
The runtime test targeted on conformance to design and general stability. For exam-
ple stability means especially focused on the hashing, packet buffer integrity, ow
identication (lookups, detecting newows) or the implemented TCP State Machine.
The results of these tests have shown that the nal version of the implementation is stable
and conformant to the software design.
4.1.2 Connection Manager with implemented TCP Snoop behavior
Figure 4.3 shows the test setup for the earlier mentioned step two, which is also the nal
setup, and the nal performance measurements were performed with this setup.
This setup consists of three hosts:
Host A, which is the Mobile Host;
Host B, which is the Fixed Host; and
50
4 Evaluation
Another node with the proxy on, which has two network interfaces: one used as the
interface to the wired network with the FH and a second one used as the interface to
the network dened as wireless network with the MH.
TCP Proxy
iperf client / netcat client iperf server / netcat server
MH FH
Figure 4.3: Testbed for TCP Snoop
It is necessary to distinguish this setup from the previous one, as TCP trafc in these two
interfaces of the proxy node are handled differently in TCP Snoop.
Main aspects of this testing were thread safety, the behavior of Connection Manager mod-
ule and its integrated TCP Snoop.
Testing the thread safety was mainly an extension of step two, because in the -
nal stage of the development the overhead for protecting shared memory areas was
massively reduced for performance reasons. Long runtime tests ensured that this
reduction does not affect the thread safety and therefore the stability of the proxy.
For tests of the Connection Manager the focus was mainly on the calculation and
processing of the dynamic values characterizing a TCP ow and the correct process-
ing of a single packet. To be more specic, these dynamic values are the RTT, current
ACK number, current SEQ number, current TCP window size and the detection and
handling of TCP options, especially the SACK option.
For the single packet processing it was tested if a packet was assigned to the correct
ow buffer, a retransmission timer was installed and that the packet with its timer
was correctly removed from the ow buffer after an ACK for this packet has arrived.
For TCP Snoop the local retransmission, detection of gaps in the SEQnumbers on the
wireless part and parsing of the TCP options was tested. Packet loss was simulated
articially by dropping a certain amount of packets in the Netlter Queue Interface
module. For the proxy this looked like a packet was totally lost.
51
4 Evaluation
4.2 Performance Evaluation
Firstly the maximum transfer rate of a virtual Ethernet interface was estimated to have
a reference point for the following measurements. This value is from now on referenced
as maximum line speed. To estimate the maximum line speed, the MH and the FH were
put in the same physical Ethernet subnet. The required trafc for the measurement was
produced by the iperf tool, which was also used for testing. This tool does the trafc gen-
eration and the bandwidth measurement automatically. More specically, one host runs
an iperf server and on the other host an iperf client. The client establishes a connection and
transfers as much as possibly data to the server for a xed period. By dividing the amount
of transferred data by the time a transfer lasted a transfer rate can be calculated.
The average maximum line speed with both hosts on the same physical Ethernet subnet
was 1.27 Gbits/sec. For the performance evaluation the same setup as in 4.1.2 was used.
Figure 4.3 shows that the TCP Proxy is directly in between of the communicating hosts as
TCP Snoop denes it.
Several aspects were measured to differentiate how different parts of the system affect the
overall performance.
Maximum line speed.
As a reference for all other measured transfer rates. It denes the upper limit for the
transfer rate. The Ethernet devices of the virtual host do not behave like a real physi-
cal device. A virtual device emulates a one gigabit Ethernet device, but it can achieve
transfer rates above one gigabit per second. Like the measured 1.27 Gbits/sec.
Ethernet Bridge throughput.
This measurement involves only the kernel Ethernet bridge, which forwards the traf-
c from one Ethernet device to the other. And therefore the trafc is received by
the bridge on one device and transferred to the kernel space. After it was checked
by Netlter it is sent on the other devices. No trafc is modied or ltered during
this measurement. Neither by Netlter nor by the TCP proxy itself, because the TCP
proxy is not running for this measurement.
The result of this measurement denes which maximum transfer rate is actually us-
able by the TCP proxy. By the reason that the physical medium is broken and all
the trafc has to be processed by the host of the TCP proxy, especially by the ker-
nel Ethernet bridge. This measurement should show the performance drawback by
just putting a transparent proxy in between compared with the maximum line speed
without a proxy.
52
4 Evaluation
Transparent Proxy throughput.
For this measurement the Ethernet bridge is running like as in the previous measure-
ment, but also the TCP proxy is running and the Netlter ltering is enabled.
The trafc is ltered in kernel space by Netlter and only TCP trafc is forwarded to
the TCP proxy, which resides in the user space.
Non TCP trafc stays in kernel space and is forwarded to the other Ethernet device
by the kernel Ethernet bridge.
This setup does not include any buffering of packets. The TCP proxy retrieves a
packet to a local memory area in user space, tells Netlter to forward this packet and
asks for the next packet which is retrieved to the same memory area.
Just decrease of the transfer rate by the additional transfer to the user space is the
focus of this measurement.
TCP Snoop throughput.
Finally all parts of the proxy implementation are enabled.
Additionally to the Transparent Proxy throughput measurement the packets are cached
and forwarded to the connection tracker and the connection manager with its inte-
grated TCP Snoop behavior.
The additional processing for the caching, connection tracking and connection man-
agement results in an additional performance drawback.
This measurement represents the maximum throughput this implementation can
archive.
All nal results of the measurements are summarized in the following table 4.1. The col-
umn named Projected (100%) shows the projected value of overall trafc that the TCP
proxy probably could handle under the assumption that TCP is 74.91% of whole the trafc.
Line speed Ethernet Bridge Transparent Proxy TCP Snoop Projected (100%)
1.27 Gbits/sec 493.8 Mbits/sec 350.5 Mbits/sec 64.5 Mbits/sec 86.1 Mbits/sec
100.0% 38.9% 27.6% 5.1% 6.8%
- 100.0% 71.0% 13.1% 17.4%
Table 4.1: Bandwidth measurement results (average values)
Other trafc than TCP is only processed by the Ethernet bridge and not by the TCP proxy
and therefore it does not decrease the performance that much, which makes the projection
a useful indicator for regular network trafc.
53
4 Evaluation
Summarizing the measurement results, simply bridging the two Ethernet devices together
leads to 61.1% (776.2 Mbits/sec) performance loss. This means that only installing the TCP
proxy reduces the maximum line speed more than half of the maximum line speed without
the TCP proxy.
Filtering and the transfer of the TCP packets to the user space additionally lowers the per-
formance by 29.0% (143.3 Mbits/sec).
Finally, Snoop with its caching and connection tracking causes again an 81.6% (286.0 Mbit-
s/sec) additional performance loss. This means that the TCP Snoop implementation leaves
only 13.1% (64.5 Mbits/sec) of the Ethernet bridge line capacity.
But it should also be mentioned that the maximum line speed is decreased by 72.4% only
by the host platform itself for bridging, ltering and forwarding the TCP packets to the
user space.
54
5 Conclusions
5.1 Summarization of Results
As wireless networks bring a number of challenges to TCP connections, this thesis studied
the TCP enhancement issue over wireless networks. After investigating related works
in this eld, a modularized and transparent TCP proxy was implemented, based on the
TCP Snoop strategy dened by Balakrishnan et al. The TCP proxy implementation, as a
prototype, can be used as a framework for implementing other TCP enhancements due to
its generic design.
To make the TCP proxy be aware of each TCP ow and gather enough information for
doing local retransmissions, a hybrid design of a TCP connection tracking used in stateful
IP rewalls and an IP stack was implemented. Common design and implementation tech-
niques like hashing, multi-threading, FIFOs and others were applied and extended when
necessary. Furthermore, as the implementation simply uses standard elements of the Linux
kernel it is also easy to port to embedded Linux devices like a real wireless network access
point (e.g., LinkSys router).
A comparison among different trafc capturing methods led to the choice of the method
for capturing and caching trafc with a user space application on a Linux PC. Interestingly,
in-depth investigations on the capturing methods revealed that memory copy operations
from the kernel space to the user space has been one of the main performance bottlenecks
of a user space TCP proxy. By choosing the most efcient method among the three in this
thesis presented methods, reduced the amount of these copy operations. Due to the hard-
ware and architecture constraints, the performance measurements showed that this still
causes a reduction of throughput by 29% for this implementation compared to transpar-
rent Ethernet bridging.
The performance evaluation shows an underestimated but tremendous packet loss: 61.1%
(776.2 Mbits/sec) of data packets are lost when transferring the trafc from one Ether-
net device to the other through the kernel space. Note, that the device to device transfer
55
5 Conclusions
through the kernel space needed for the Ethernet bridging and the transfer of TCP trafc
to the user space TCP proxy are both issues caused by chosen platform itself and not by
the actual prototype implementation. Actually, 72.4% of the whole line capacity is already
lost before any packet is processed by the TCP proxy.
The performance measurements for the TCP proxy itself showed that the additional pro-
cessing of each TCP packet is done at the cost of about 286.0 Mbits/sec throughput lower
than the line speed, which leaves only 64.5 Mbits/sec. This transfer rate is barely enough
for todays wireless LANs (802.11g standard) with 54 Mbits/sec. Due to the fact that the
PC platform used for the implementation has more computing power than an embedded
platform, a wireless access point or similar platforms should not be considered for running
a TCP Snoop. At least not as user space application, but under circumstances it could be
applicable to run a TCP Snoop as kernel module for low transfer rates.
Each TCP packet processed by the TCP proxy has to traverse the TCP Connection Tracker
and Connection Manager module before it can be acknowledged to Netlter, that it is al-
lowed to be forwarded by the Ethernet bridge. This is necessary because there is the pos-
sibility that it must be modied for hiding a loss event. Loss events are only detected by
the Connection Manager module. The resulting delay by the user space processing is the
sum of delay by kernel to user space transfer, buffer management for caching, connection
tracking and the actual TCP Snoop behavior of the Connection Manager.
Finally, our performance evaluation results suggested that although a user space daemon
is plausible and implementable in the user space, it is probably not the best choice for
implementing a TCP proxy. Also, using the kernel Ethernet bridge for making a Linux
host transparent to the network is not effective, because it reduces the original transfer rate
massively (down to 38.9% accordly to the measurements of this thesis). For TCP Snoop,
the processing of each TCP packets leads to a delay due to its design because TCP Snoop
has to keep the possibility to modify a packet. A packet is delayed due to the processing
in the Ethernet bridge and the TCP Snoop application. Hence, only slow network links
may be applicable for taking the advantages of TCP Snoop on a Linux PC platform with a
user space application. More specialized hardware would be a much better platform and
would also give a better ratio of data transfer to the needed processing power. As a result,
for running TCP Snoop or other proxy approach, we need a much more powerful platform
capable of processing trafc faster than line speed. This motivates us in the following up
TCPNP project
1
, where network processors are being used for this purpose.
1
http://www.net.informatik.uni-goettingen.de/research_projects/tcpnp
56
5 Conclusions
5.2 Future Work and Outlook
There are several things which could or should be done in future work.
To name some of them:
Optimize the current implementation to achieve transfer rates above 100Mbits/sec.
This would be a good foundation to produce comparable measurement for a perfor-
mance evaluation.
Also experimenting with optimizations like reducing TCP connection tracking to a
minimumor using other data structures to enhance the performance of the user space
TCP proxy would be an option.
A user space application can be modied more easily and safely than a kernel mod-
ule, because a crashing user space application does not crash the whole system.
Implementing TCP Snoop as kernel module.
This would probably give back the 29% performance loss by the transfer to the user
space. Also using and extending the TCP connection tracking in the Netlter module
could gain performance. On the other side a kernel module raises the complexity
more than a user space application and the performance loss for the transfer from
one network device to another stays.
Implementing TCP Snoop on a network processor (NP) platform.
These network processor platforms mostly have a store-and-forward architecture
which allows direct processing of data trafc by several microengines. The main
CPU of the host only monitors the NP platform.
Using a specialized hardware like a NP platform seems to me the most promising choice
for future work. Data trafc can be buffered to the DRAM without being processed by the
Operating system or the general purpose CPU of the host PC.
This probably eliminates the high performance loss from just putting a transparent TCP
proxy between two or more communicating hosts and transferring the trafc simply from
one network device to another.
Without this performance loss, the real behavior of TCP Snoop at line speed can be ob-
served and the enhancement can be measured.
57
Bibliography
[Allman] Mark Allman. A Web Servers View of the Transport Layer.
http://www.icir.org/mallman/tcp-opt-deployment/
NASA Glenn Research Center/BBN Technologies, October 2000
[Bakre] Ajay V. Bakre and B.R. Badrinath. Implementaton and Performance Evaluation of In-
direct TCP. IEEE Transactions on Computers, Vol 46, No. 3, March 1997
[Balakrishnan1995] Hari Balakrishnan, S. Seshan and R. H. Katz. Improving reliable trans-
port and handoff performance in cellular wireless networks.. ACM Wireless Networks,
December 1995
[Balakrishnan1996] Hari Balakrishnan, S. Seshan and R. H. Katz. A comparison of mecha-
nisms for improving TCP performance over wireless links.. ACM SIGCOMM96 Palo
Alto, August 1996
[Balakrishnan1998] Hari Balakrishnan and Randy H. Katz. Explicit Loss Notication and
Wireless Web Performance.. Proc. IEEE Globecom Internet Mini-Conference, Syd-
ney, Australia, November 1998
[Balakrishnan1999] Hari Balakrishnan, V. Padmanabhan and R. H. Katz. Effects of asym-
metry on TCP performance.. Mobile Networks and Applications 4, 3, p219-241,
1996
[Caceres] R. Caceres and L. Iftode. Improving the performance of reliable transport protocols in
mobile computing environments.. Areas Commun. volume 13, June 1995
[Chen] Xiang Chen, Hongqiang Zhai, Jianfeng Wang, Yuguang Fang. A Survey on
Improving TCP Performance over Wireless Networks.. University of Florida,
Gainesville, 2005
[Floyd] Sally Floyd. Questions about SACK Deployment.
http://www.icir.org/oyd/sack-questions.html
http://www.icir.org/oyd/sacks.html
Survey and collection of papers about SACK, October 2000
58
Bibliography
[Leinen] Simon Leinen. Long-Term Trafc Statistics - from an access router at a random uni-
versity. Columbia University, February 2001
[RFC2018] M. Mathis, J. Mahdavi, S. Floyd and A. Romanow. TCP Selective Acknowledg-
ment Options. Internet RFC, October 1996
[RFC2581] M. Allman, V. Paxson and W. Stevens. TCP Congestion Control. Internet RFC,
April 1999
[Rooij] Guido van Rooij. Real Stateful TCP Packet Filtering in IP Filter.
[Stevens] W. R. Stevens (Autor) and Gary R. Wright. TCP/IP Illustrated II. The Implementa-
tion. Addison-Wesley, 1995
59

TCP Performance Enhancement in Wireless Environments: Prototyping in Linux

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

TCP Performance Enhancement in Wireless Environments: Prototyping in Linux

Hochgeladen von

Copyright:

Verfügbare Formate

Georg-August-Universitt

Das könnte Ihnen auch gefallen