Sie sind auf Seite 1von 63
Georg-August-Universität Göttingen Zentrum für Informatik ISSN 1612-6793 Nummer GAUG-ZFI-BSC-2008-05

Georg-August-Universität Göttingen Zentrum für Informatik

ISSN

1612-6793

Nummer

GAUG-ZFI-BSC-2008-05

Bachelorarbeit

im Studiengang "Angewandte Informatik"

TCP Performance Enhancement in Wireless Environments:

Prototyping in Linux

Swen Weiland

Arbeitsgruppe für

Computernetzwerke

Bachelor- und Masterarbeiten des Zentrums für Informatik an der Georg-August-Universität Göttingen

13. Mai 2008

Georg-August-Universität Göttingen Zentrum für Informatik

Lotzestraße 16-18 37083 Göttingen Germany

Tel.

+49 (5 51) 39-1 44 14

Fax

+49 (5 51) 39-1 44 15

Email

office@informatik.uni-goettingen.de

WWW

www.informatik.uni-goettingen.de

Ich erkläre hiermit, dass ich die vorliegende Arbeit selbständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe.

Göttingen, den 13. Mai 2008

Bachelorarbeit

TCP Performance Enhancement in Wireless Environments: Prototyping in Linux

Swen Weiland

13. Mai 2008

Betreut durch Prof. Dr. Xiaoming Fu Arbeitsgruppe für Computernetzwerke Georg-August-Universität Göttingen

Contents

1 Introduction

6

1.1 Motivation to Optimize Wireless Networks

 

7

1.2 Contribution of This Thesis

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

7

1.3 Thesis Organization

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

8

2 Background and Related Work

 

9

2.1 TCP Basis

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

9

2.2 Existing Works on TCP Improvements in Wireless Networks

 

10

2.3 TCP Snoop

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

12

2.3.1 Overview

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

12

2.3.2 Basic Idea and how Enhancements are achieved

 

13

3 Proxy Implementation Design

 

15

3.1 Overview and a Brief Function Description

 

15

3.2 Interface to the Kernel for capturing

 

19

3.3 Capturing Packets - a Comparison of Methods for Capturing

 

20

3.3.1 libPCap

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

20

3.3.2 RAW Sockets

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

22

3.3.3 Netfilter Queue

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

24

3.3.4 Kernel Ethernet Bridge .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

24

3.3.4.1

Netfilter

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

25

3.3.5 Conclusion of the Comparison

 

26

3.4 Module Design of the Proxy Implementation

 

28

3.4.1 Operating System requirements

 

28

3.4.2 Module: Buffer Manager

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

28

3.4.3 Module: Netfilter Queue Interface

 

31

3.4.4 Module: TCP Connection Tracker

34

3.4.4.1 Prime Numbers for the Secondary Hash Function

 

35

3.4.4.2 Identify a TCP flow

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

36

3.4.4.3 Retransmission Buffer

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

39

3.4.5 Module: Timer Manager

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

40

3.4.6 Module: Connection Manager

 

42

2

Contents

3.4.6.1 Stateful tracking

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

43

3.4.6.2 Round Trip Time calculation

 

45

4 Evaluation

48

4.1 .

Testing

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

48

4.1.1

TCP Connection Tracker

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

49

4.1.2

Connection Manager with implemented TCP Snoop behavio r

 

50

4.2 Performance Evaluation

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

52

5 Conclusions

55

5.1 Summarization of Results

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

55

5.2 Future Work and Outlook

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

57

Bibliography

58

 

3

 

List of Figures

2.1

TCP Header

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

10

3.1

Overview: Implemented modules and their communication

 

16

3.2

Dataflow with libPCap

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

21

3.3

Dataflow with RAW Sockets

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

23

3.4

Ethernet Type II Frame format .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

24

3.5

Dataflow with Netfilter Queue

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

26

3.6

Logical structure: Linked list used as a FIFO

 

29

3.7

Implementation: Linked list at initial state

 

30

3.8

Implementation: Linked list after retrieval of a chunk

 

30

3.9

Implementation: Linked list after returning the chunk

 

31

3.10

TCP Handshake

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

38

3.11

Simplified TCP State Machine Diagram from the Connection Tracker

 

47

4.1

Testbed for initial TCP connection tracking test

 

49

4.2

Testbed for TCP connection tracking test

 

50

4.3

Testbed for TCP Snoop

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

51

 

4

 

Abstract

In recent years, wireless communication gets more and more p opular. Future wireless standards will reach throughputs much higher than 100 Mbit/s ec on link layer. However, wireless channels, as compared to wired lines, exhibit different characteristics due to fad- ing, interference, and so on. For transport control protoco l (TCP), the misinterpretation of packet loss due to wireless channel characteristic as net work congestion results in sub- optimal performance. There are many different approaches t o enhance TCP over wireless networks, especially for slow and lossy links such as satellite connections. This thesis eval- uates “TCP Snoop” as one of these approaches for high transfe r rates. Finding, using and implementing effective capturing, buffering and tracking of TCP communication were the objectives to solve. A general and transparent TCP proxy wit h “TCP Snoop” behavior was implemented during the work for this thesis. The TCP prox y runs on an interme- diate Linux host which connects wired and wireless networks as a prototype user space application with a modular design.

Different traffic capture methods are compared in portabilit y and performance. A full TCP connection tracking is described and implemented. Design p atterns and methods that proofed their benefit in practice were applied and sometimes p artially modified to fit into the needs of the transparent TCP proxy. The modular design makes exchanging a low level module such as the data traffic capture module possible. Porting the implementation to another operating system, another platform like embedde d systems which are used as wireless LAN routers or changing the TCP enhancement method are also eased by the modular design.

The results show that a transparent TCP proxy or other traffic modifying implementation should not reside in the user space for performance reasons. A kernel space implementa- tion or even better a dedicated hardware like a network proce ssor platform should be used for such implementations.

1 Introduction

Nowadays, the Internet is more and more based on wireless technologies, especially the last few meters or even few miles. For example, many of the DSL Internet connectivities in Germany are bundled with subsidized wireless access poin ts in their product packages. From my personal experience, people like to sit on the sofa or in the garden outside the house, while they are still using the Internet for surfing, chatting and downloading. In such scenarios, wireless is becoming more and more desirable.

Accordingly to the traffic analysis study by Simon Leinen at Co lumbia, the majority of traffic, about 74% [Leinen], of the Internet is TCP. An extract of the results from his traf- fic analysis is presented on the bottom of this page as Table 1.1 . As TCP originally was designed for wired communications, there are some drawback s for wireless scenarios. If these issues could be solved or at least optimized, this would also optimize the majority of Internet traffic. This thesis focuses on the TCP performance enhancement issu es over wireless environ- ments. More specifically, I performed a prototype implementat ion of a transparent TCP proxy as user space application, for optimizing end-to-end TCP performance. The imple- mentation is meant to be run on or very near to the last hop to a mobile node. A user space application benefits from a straight design and is indep endent from any restrictions for implementing a kernel module. The lowered security leve l of a user space application protects the operating system and leads to a good system stability. Moreover, debugging is easier in user space and simplifies rapid prototyping.

Protocol Flows Flows (%) Packets Packets (%) bytes Bytes (%) GRE 383 0.00 % 17235
Protocol
Flows Flows (%)
Packets Packets (%)
bytes Bytes (%)
GRE
383
0.00 %
17235
0.00 %
3602115
0.00 %
ICMP
101931237
1.75 %
305793711
0.45 %
37918420164
0.11 %
IGMP
34662
0.00 %
901212
0.00 %
58578780
0.00 %
IP
1406788
0.02 %
15474668
0.02 %
3528224304
0.01 %
IPINIP
1297
0.00 %
1297
0.00 %
583650
0.00 %
TCP
4361852662
74.91 % 63919315234
93.39 % 32455859980970
96.82 %
UDP
1357265629
23.31 %
4201556174
6.14 %
1025284993546
3.06 %

Table 1.1: Analysis taken by Simon Leinen on an access router at a random university [Leinen].

6

1 Introduction

1.1 Motivation to Optimize Wireless Networks

In TCP, packet losses are interpreted by the TCP stack as cong estion by default. For wired network hardware today, packet loss is not a real problem any more because of the very low bit error rate. However, wireless and mobile networks are often characterized by sporadic high bit-error rates, intermittent connectivity and effects of interferences [Caceres]. This results in higher bit error rates than in wired networks. Add itionally a sender lowers the packet sending rate due to the misinterpretation of packet losses as congestion.

To avoid this, some researches propose various TCP enhancements [Bakre], [Balakrishnan1995], [Chen] which try to reduce or eliminate these impacts. For ap plying some enhancement to a network, in most cases the infrastructure has to be chang ed or the nodes have to be reconfigured. If a transparent TCP proxy is used, nothing of th e infrastructure has to be changed, only the proxy function is added.

The positioning of such a proxy should be as near as possible t o the wireless part, in order to react more quick on changes in the wireless part of the netw ork. Directly in a base station of a wireless network is the nearest and therefore be st position. I assume it has sufficient RAM and processing power to perform necessary prox y function, which could be plausible due to today’s manufacture technologies and I w ill also go back to this issue in the evaluation in the later parts of this thesis.

1.2 Contribution of This Thesis

The contribution of this thesis can be summarized as follows :

Representative approaches for TCP enhancements over wireless environments are identified and classified in three groups.

A transparent proxy approach, TCP Snoop [Balakrishnan1995 ], [Balakrishnan1996], [Balakrishnan1999], is selected for primary study due to it s nice tradeoffs between functionality and complexity. A software design over Linux is presented and imple- mented.

A key technique used in transparent proxy approaches, namely data capturing, is identified. A data capturing solution is chosen out of several alternative solutions, according to their ability and performance to capture and mo dify the throughpassing traffic.

Finally, a performance analysis of the implementation is given and the TCP Snoop approach as user space application is evaluated systematically.

7

1 Introduction

1.3 Thesis Organization

The thesis starts with a general survey of related work, including an introduction to “TCP Snoop” as the background for the software design and impleme ntation. I am then dis- cussing the design of the software framework:

Firstly requirements and dependencies to other software are introduced, followed by a bottom up overview about the general design of the software w ith a short functional de- scription. After deciding the most efficient capture method for this implementation, the thesis describes how each specific module was designed, imple mented, which design pat- ters were applied and why the implemented algorithms were ch osen. Note that source code is not discussed directly, only necessary function des criptions or data structures in order to give a closer view on the software. The implementation is then evaluated with some test cases and analyzed in terms of performance.

For the convenience of the readers, below I give a short summary of the notation I use in this thesis:

Bibliographic sources are given as references in brackets.

Shell commands and source code is shown in framed boxes.

Bold formatted words are representing references to labeling on figures and/or names of software modules. If a word is formatted italic, it represents an important expres- sion or a project name of a software implementation which was used or compared to.

References to the Internet are directly inserted into the te xt as footnotes.

8

2 Background and Related Work

2.1 TCP Basis

Basic knowledge about Internet Protocol (IP) and the Transmission Control Protocol (TCP)

is assumed. In this thesis the focus is on TCP and in this parag raph some facts about TCP

are recapitulated which are referred to later or explained later in more detail.

If you have good knowledge about TCP, the rest of this section can be skipped and you can

continue reading at section 2.2.

TCP is a reliable stream delivery service that guarantees to deliver a stream of data sent from one node to another without duplication or losing data.

It also provides flow control and congestion control. A flow or connection is identified by

a source and destination IP address and especially by the sou rce and destination TCP port

number. Reliability is achieved with acknowledgment messages, which are sent along with data packets or standalone with an empty data packet. These ackno wledgment messages are represented as special fields in the TCP protocol header. To be precise, the acknowledgment number (ACK number) and the acknowledgment flag (ACK flag) are meant.

The structure of the TCP protocol header is shown on the follo wing Figure 2.1 on page 10. Position of the ACK number and the ACK flag can be looked up on the figure. By comparing the sequence number of already sent packets with the acknowledgment

number of a currently received packet, the TCP stack can decide which packets have reached their destination, are still on the way or are lost. A ll sent out packets in the outgo- ing buffer with a sequence number lower than the acknowledgment number of a received packet have reached their destination. Every time a packet or a bunch of packets are sent out a retrans mission timer is started.

A loss event is a timeout of this timer or 3 duplicate ACKs.

“The fast retransmit algorithm uses the arrival of 3 duplicate ACKs (4 identical ACKs with- out the arrival of any other intervening packets) as an indication that a segment has been lost.” [RFC2581, paragraph 3.2 on page 6]

9

2 Background and Related Work

32 Bit 0 15 16 31 source port destination port sequence number acknowledgment number U
32 Bit
0
15
16
31
source port
destination port
sequence number
acknowledgment number
U
A
P
R
S
F
data
reserved
R
C
S
S
Y
I
window
offset
G
K
H
T
N
N
checksum
urgent pointer
options
~
~
~
~
(0 or more 32-Bit-words)
~
~
data
~
~

Figure 2.1: TCP Header

Loss events are the most important information for a TCP prox y, because it is one of the main tasks to react on and solve these loss events. Later in 3.4.4 more parts of TCP header are addressed and Figu re 2.1 can be used as a reference.

2.2 Existing Works on TCP Improvements in Wireless Networks

This section gives a brief overview about related work of mechanisms for improving TCP performance over wireless links. Accordingly to the work of Balakrishnan et al. [Balakrishnan1996] and Xiang Chen et al. [Chen] these mechanisms can be grouped into:

End-To-End End-To-End schemes apply enhancements directly into the TCP stack or extend it as TCP options. This implies a modification of the TCP stack, which is mandatory for both communication partners if they want to benefit from this t ype of enhancements. Examples are TCP-NewReno or Explicit Loss Notification (ELN) [Balakrishnan1998]. The behavior of the TCP stack for loss events is optimized or additional information is sent out and processed to realize a better differentiatio n between congestion and packet loss caused by link layer errors.

10

2 Background and Related Work

Link-Layer Link-Layer approaches try to make the link layer aware of hig her layer protocols like TCP. No TCP stack modification on the communicating nodes is re quired, but an intermediate node is added to the network infrastructure. An example for this group is TCP Snoop. Link layer errors are corrected by enhanced retransmission techniques and because of the transport layer awareness these errors can be hidden. Misinterpretations by the transport layer that congestion occurred instead of link layer errors are suppressed.

Split-Connection As the name suggests, a flow of a connection oriented and reliable transport layer protocol, e.g. TCP, is split at an intermediate node into two separate flows. Therefore the intermediate node needs to maintain two separate TCP stacks, but data copying between these stacks is avoided by passing only pointers and having a shared buffer. Examples for split connection schemes are I-TCP and SPLIT.

All mechanisms are very similar, because they fight against th e same issues with slightly different methods and effort. Link layer transmission errors are detected and misinterpretations that congestion oc- curred are avoided. Asymmetric network links are handled mo re efficiently and addi- tionally some of these mechanisms do caching and local retransmission to recover more efficiently from packet losses.

Modifying the TCP stack like the End-To-End approaches is very effective, because wrong or ineffective behaviors are suppressed directly at their s ource. Additional processing caused by this type of modifications is often very little. On th e other side, these approaches are hard to deploy because in most cases both end-nodes needs to implement these modifi- cations to benefit from them and in some cases also middle-node s are involved. Link layer feedback is limited and issues are mainly fixed on transport layer. For Link-Layer or Split-Connection approaches only an intermediate node or module is added to the network infrastructure and no end-node has to be modifie d. This means they are more easy to deploy, but adding an additional node means also raising the needed pro- cessing power for communication. An intermediate node can o nly guess the state of a TCP stack and only influence it indirectly. Split-Connection approaches try to solve the indirect influence issue by maint aining two separate TCP stacks for each flow on the intermediate node, bu t this also nearly doubles the processing overhead. Link-Layer approaches are a good tradeoff, because they are more easy to deploy and their reduced complexity against Split-Connection approaches. Further Link-Layer approaches do not break the end-to-end communication like Split-Connection approaches, which makes roaming in wireless networks possible without any trouble.

11

2 Background and Related Work

2.3 TCP Snoop

2.3.1 Overview

As by Balakrishnan et al. [Balakrishnan1995] defined, TCP Sno op sought to improve the performance of the TCP protocol for wireless environments w ithout changing the exist- ing TCP implementations neither in the wired network nor in t he wireless network. It is designed to be used on or near to a base station of a wireless ne twork, to enhance TCP end-to-end connections. Every through passing TCP packet is buffered in a local memory for doing a fast and local retransmission if a packet loss occurs on the wireless part of the network. TCP Snoop behaves as a transparent proxy and maintains a cache of TCP pack- ets for each TCP flow. Lost packets are detected and locally re transmitted. Therefore each flow is tracked with an additional but simplified TCP stack in th e proxy implementation. TCP Snoop does not break end-to-end connectivity like Split -Connection approaches and stays completely transparent. This makes roaming from one w ireless part to another wire- less part of the same network is possible.

Reasons for treating wireless networks for TCP enhancement differently are the following. Wired network links are very reliable and generally a higher bandwidth compared to wire- less links. Reasons for this are the physical medium access method Carrier Sense Multiple Access/Collision Avoidance (CSMA/CA) and the medium itself. The wireless medium is more vulnerable to physical interference and it is only a half duplex medium. In opposite wired communication is mainly used as a full duplex medium an d can easily be shielded as protection against physical interference.

Triggered by lost packets or duplicated ACKs which are also us ed for signaling lost pack- ets, a TCP stack may suspect a congested link and lowers its se nding rate. This leads to a suboptimal usage of bandwidth for wireless networks, because congestion in standard TCP is only detected by losses but in wireless networks there are many reasons for losses. Wireless links can recover to a higher bandwidth very quick if an interference stopped or weakened, but the TCP stack detects this very slowly compare d with detecting congestion. There is simply no signaling for the bandwidth changes in the standard TCP stack. To avoid this, duplicate ACKs are suppressed by the TCP Snoop p roxy and local retrans- mission is triggered for every lost packet. For wired links t he congestion assumption is nor- mally the right choice, because of the lag of temporal high bit error rates in the medium. With a high probability wired links have reached their curre nt maximum bandwidth if packet losses occur, which means congestion.

12

2 Background and Related Work

2.3.2 Basic Idea and how Enhancements are achieved

TCP Snoop is implemented as a transparent TCP proxy near to or on a base station of a wireless network. A TCP proxy has two Ethernet devices and fo rwards all traffic from one device to the other. On the way from one device to the other, a p acket can be modified by the proxy if necessary and this modification is also transpare nt for the network. Non TCP traffic is ignored and just forwarded, but TCP traffic is process ed by the proxy before it is also forwarded. Processing is defined as identifying each TCP flow and tracking it. If a packet loss is detected during the tracking, a local retr ansmission is done. Losses are detected by a certain amount of duplicate ACKs and by timeouts of a locally installed retransmission timer, which is part of a simplified TCP stack in the proxy. The simplified TCP stack is utilized for tracking TCP and tracks SEQ numbers , ACK numbers and other dynamic values of a previously identified TCP flow.

The proxy should be placed as near as possible to the wireless base station on the network topology to reduce response times. All unacknowledged packets from the fixed host (FH) to the mobile host (MH) are cached in the buffer of the proxy. This should be a buffer in a fast memory like DRAM or SRAM. Unnecessary congestion control mechanism invocations for a TCP flow are avoided due hiding duplicate ACKs and doing the local retransmission.

The authors of the original paper that defined TCP Snoop update d it [Balakrishnan1996], [Balakrishnan1999] to improve the performance for packet losses on the way from the MH to the wireless base station and is implemented by sending se lective retransmission re- quests to the MH and be described as follows. Packets from the MH to the FH are processed and cached as normal, but if a gap in the ascending sequence numbers is detected, a Negative Acknowledgment (NACK 1 ) is sent from the proxy to the MH, which triggers a retransmission. A ne w and for the TCP proxy cacheable copy of the lost packet should be on the way if the FH realizes that it was lost.

Native TCP supports only positive acknowledgment of packet s. The Selective Acknowl- edgment (SACK) is a TCP option, which allows sending selective Acknowledgments (ACKs) or selective Negative Acknowledgments (NACKs) for specific packets. A TCP option de- fines an extension to TCP, which is sent in an optional part of th e TCP header.

To verify that using this TCP SACK option and the proposed upd ates are applicable, I investigated about the distribution of SACK. From December 1998 to February 2000, the fraction of hosts sampled with SACK-capable TCP has increase d from 8% to 40% [Allman]. Today it should be 90% and above, because SACK is supported by any major operating

1 Part of the Selective Acknowledgment (SACK) TCP option. Standardized by RFC 2018.

13

2 Background and Related Work

system. To give some names, SACK is supported by Windows (sin ce Windows 98), Linux (since 2.2), Solaris, IRIX, OpenBSD and AIX. [Floyd]

The host support of the SACK extension is detected by the prox y during the three-way- handshake which establishes a TCP flow. If SACK is not supported by one of the two hosts, it cannot be us ed for this flow at all. This is defined by RFC 2018, which defines TCP SACK option. In this case the proxy skips the enhancement for the traffic from the MH to the FH and just tries to enhance the traffic from the FH to the MH.

14

3 Proxy Implementation Design

3.1 Overview and a Brief Function Description

The TCP proxy prototype implements a transparent TCP enhancer. With a modular design for easier extension and the possibility to implement other TCP enhancements in future work. For enhancing the TCP traffic the proxy must have the ability to drop or modify through passing TCP packets. To achieve this, the TCP proxy breaks the physical medium and puts itself in between (see figure 4.3) as an intermediate n ode. This gives the TCP proxy the total control which packet is forwarded, modified or dropped, because he has to forward each packet from one interface to another.

The following figure 3.1 on page 16 gives an overview about the g eneral software design of the proxy and it’s core modules. Details of each module and all applied design patterns or used algorithms are described later in this chapter at 3.4 .

For implementation the C programming language was used. Multi-threading, thread syn - chronization and mutual exclusions (mutexes 1 ) were implemented by using POSIX Threads. POSIX Threads is a POSIX standard for threads and defines a standard API for creating an d manipulating threads. Mutexes are needed to secure and order asynchronous write ope rations. The POSIX thread- implementation is platform independent, well standardize d, good documented and it is usable as a simple library or integrated in the operating sys tem.

The main target platform is Linux with kernel version 2.6.14. This is conditioned by the need of using libnetfilter_queue 2 as interface in the kernel, which is used for capturing, filtering and modifying TCP packets. This extension of the Netfilter packet filter demands this version of the Linux kernel. Netfilter is the standard packet filter of the Linux operat- ing system. All needed functionality to use this interface is implemented in the Netfilter Queue Interface (see figure 3.1) module of the proxy.

To support older kernel versions, which are often used on embedded system like Wireless LAN (WLAN) routers, only this Netfilter Queue Interface module has to be replaced with

1

http://en.wikipedia.org/wiki/Mutual_exclusion 2 Netfilter Queue - see 3.3.3 for further description

15

3

Proxy Implementation Design

Ethernet Device #0 Kernel Ethernet Bridge Device Ethernet Device #1 Netfilter Queue Interface Retransmitted TCP
Ethernet Device #0
Kernel Ethernet Bridge
Device
Ethernet Device #1
Netfilter Queue
Interface
Retransmitted TCP
Newly created
traffic via
TCP packets
RAW Socket
Free buffer
chunks
Acknowledgement
for forwarding
TCP Connection
Tracker
Packet Generator /
Manipulator
Connection Manager
Buffer Manager
Timer Manager
With Buffer
TCP Snoop
Only TCP

Figure 3.1: Overview: Implemented modules and their commun ication

some other module which does the capturing and filtering of the TCP traffic. Such a module could use RAW-Socket 3 for capturing and a generic protocol classifier to detect the TCP traffic. Also a back-port to earlier versions of the Netfilter framework would be possible with some but minimal effort. Back-porting to so me embedded router could be done in future works if needed. The modular design makes th is possible and easier achievable.

The Netfilter interface is chosen from other methods of captur ing for performance reasons, which is described and analyzed later in 3.3. A proxy instance consists of three threads. The main thread, which is created by the op- erating system, is the first thread. After some basic initialization the main thread installs an operating system callback function which triggers the Timer Manager module. This callback function is called in a fixed interval. It is counted as a separate thread, because

3 RAW-Socket - see 3.3.2 for further description

16

3

Proxy Implementation Design

actions can be triggered or data manipulated asynchronous t o the main thread. The third thread is also created by the main thread and used fo r capturing. The capturing thread resides in the Netfilter Queue Interface module.

Retransmission is implemented by utilizing RAW-Sockets. T hese allow creating and send- ing custom TCP packets. Every field in IP and TCP header can be se t to a custom value. In the case of the TCP proxy, mainly a previously buffered packe t is just retransmitted.

To do the retransmission, the proxy must be aware of each TCP flow and its state. Only with this information the proxy knows the point in time and wh ich TCP packet has to be retransmitted. After some TCP traffic was captured by the Netfilter Queue Interface module, it is handed over to the TCP Connection Tracker module. The TCP Connection Tracker module gathers the following information:

Source and destination IP address

Source and destination TCP port

Packet is stored in a queue (one for each direction by connect ion)

Pointer to the corresponding management structure

After the TCP flow is identified by the TCP Connection Tracker and the corresponding management structure is known or created, all this informat ion is passed to the Connec- tion Manager module which adds the following information to the manageme nt structure:

Connection Status

State (TCP State-Machine)

Acknowledge Number (ACK)

Sequence Number (SEQ)

Present TCP options

Round Trip Time (RTT)

Timer (for retransmission)

17

3

Proxy Implementation Design

The Connection Manager is the main module, makes all important decisions and can be seen as a concentrator for all the information. TCP packets can be forwarded, dropped, modified or retransmitted. Also new packets can be created. As you see in figure 3.1, in this module the TCP Snoop behavior is implemented.

There are also three helper modules: Buffer Manager, Timer Manager and Packet Gener- ator/Manipulator.

The Buffer Manager offers some management functions for the buffer memory of the proxy. A big memory block is reserved at the initialization of this module and divided into small chunks. Some other module, mainly the Netfilter Queue Interface for capturing and storing new data, can retrieve an unused ch unk from the Buffer Manager. If a chunk is not needed any more, it is returned to the Buffer Manager. Different chunk managements are implemented. They are used during capturing, like described before, and for the queue management per TCP flow. The TCP packets for each tracked TCP flow are buffered in a special queue. There is one queue per communication direction of the flow.

The Timer Manager is used by the Connection Manager to install the retransmission timers and to install special timeout timers for some TCP states. It offers to trigger a specific action after a specified time period to other modules. This Timer Manager module is necessary, because an application thread can only in- stall one timer callback with only one fixed interval. One callback would be not enough if the Connection Manager wants to install at least one retransmission timer for each TCP flow. Therefore t he Timer Manager han- dles and manages this one callback for every module which wan ts to install a timer or as many timers they want. Repeating and one-time timers are possible. The callback is realized and handled with the signal handlin g of the operating sys- tem. Another way to implement it would be busy waiting in a sep arate thread, but this is definitely a bad choice. Busy waiting is in most cases a bad choice for software.

The Packet Generator/Manipulator is only used by the Connection Manager to cre- ate new packets or modify packets. For example, hiding duplicate ACKs or sending a NACK for a specific packet to the original sender.

18

3

Proxy Implementation Design

3.2 Interface to the Kernel for capturing

The Linux operating system segregates its memory into kerne l space and user space. Kernel space is privileged and strictly reserved for the kern el and device drivers. A normal application runs in user space and has no direct access to ker nel space memory areas.

If a TCP packet arrives, it is pushed as an Ethernet frame from the device driver of the Ethernet card to the Ethernet stack in the kernel space. From here it is passed to the TCP/IP stack in kernel if it was identified as IP traffic. We assume this as example. To send this packet to an application running in user space, it has to be copied to a memory area in user space. This has to be done, because the kernel space memory is not directly accessible from user space. In the case of passing data from t he kernel to a user space application, it happens nothing more than duplicating a memory area to the user space. Also if the packet is not needed any more after duplicating, t he data has to be copied to an address space that is assigned to the user space. This effo rt has to be done for security reason. Just remapping the kernel space memory block to the u ser space is not possible. Copying memory and throwing one of the copies away is an expen sive operation. How to reduce the amount of copy operations is shown in the next sect ion 3.3. Via the TCP socket, a kind of interface library, the applicat ion passes a pointer to a memory area in user space for the incoming packet to the kernel. The k ernel duplicates the buffer with the incoming packet to the memory in user space, address ed by the pointer.

Normally the Ethernet card passes only Ethernet frames which are addressed to its Media Access Control address (MAC address) to the device driver. Bu t the card can be set into a

special mode. This mode is called “promiscuous mode” and mak es the Ethernet card pass all Ethernet frames on the wire to the device driver. In this mode the kernel has to process all the traffic on the wire. The traffic which is destined to its ow n IP address and to any other host in the Ethernet subnet. Additional traffic to other hosts caused by the “promiscuous mode” is dropped by the Ethernet stack or later in the IP stack of the kernel. Only traffic to the own host and which

is addressed to a corresponding application on the host is passed to the user space.

To get a copy of all traffic to user space, a special kernel inter face is needed, especially for the traffic which is destined to other hosts. With such an inter face it is possible to grab

a copy of the traffic before it reaches the Ethernet stack or before it reaches the IP stack. Means, Ethernet frames or IP packets can be grabbed and pulled to the user space. Such an interface can be a RAW Socket (described at 3.3.2) or a special kernel module like Netfilter (described at 3.3.3). The action of using such an interface is usually known as “sniffing” or “capturing”.

19

3

Proxy Implementation Design

As we now know, passing traffic to the user space is expensive. B ut passing all traffic from the wire to user space is very expensive! The “promiscuous mode” leads also to much more work for the ke rnel, because the traffic which is not destined to the own host is also processed. On slo w machines this can easily head into performance problems.

3.3 Capturing Packets - a Comparison of Methods for Capturing

In this section different methods for capturing are shown an d compared in focus of perfor- mance and usability for the TCP proxy. The selected methods are chosen because of their popularity and they are present and/or usable for at least 5 years. This should assure, that a protot ype based on such an interface or library will be usable with newer versions of operating sy stems and/or these interfaces. Other proprietary kernel modules, which are working only fo r a few Linux kernel versions were ignored.

3.3.1 libPCap

The Packet Capture library (libPCap 4 ) provides a multi platform and high level interface for packet capturing on transport layer. It puts the Ethernet device into “promiscuous mode” and supp orts many common Operat- ing Systems like different Linux, Windows, MacOS and BSD vers ions. The interface to the library is well documented and packet capturing can be imple mented within a few lines of source code.

Early versions of my TCP proxy used libPCap for capturing, be cause of the easy han- dling and the support for many platforms. The library itself uses RAW Sockets (3.3.2) for capturing Ethernet frames and abstracts only the usage of RAW Sockets on the different platforms. Additionally it adds a buffer management and filter managemen t. The application does not have to care about the handling of an incoming buffer and also buffer overflows are handled by the library. Just the buffer size and a callback fu nction for handling incoming frames needs to be defined during the initialization.

4 http://www.tcpdump.org/

20

3

Proxy Implementation Design

All the traffic from the wire is stored in the buffer and then filte red. If no predefined filter applies, the library calls the previously defined callback function. In this function the Ethernet frame has to be processed by the application. After the callback function returned the control to the library, the memory with the Ethernet frame is freed and reused for capturing. Freeing the memory after the callback function is the main disadvantage for implementing the proxy. If the proxy wants to retransmit a packet, the frame with the packet has to be copied during the callback to another buffer (see (5) on figure 3.2). If not, the packet would be lost for the proxy after the callback. It cannot be retrans mitted if this is needed.

Ethernet Devices eth0 eth1 With Device Driver Netfilter 1 Pointer IP Stack framework Pointer Buffer
Ethernet Devices
eth0
eth1
With Device Driver
Netfilter
1
Pointer
IP Stack
framework
Pointer
Buffer in
Ethernet Stack
Copy with RAW Socket
(All traffic)
To some Application.
Copy with IP Socket to Userspace.
4
Kernel Space
Copy with RAW Socket
User Space
2
3
5
Copy
TCP proxy
libPCap Buffer
Pointer
Callback Function
(only TCP)
Retransmission Buffer
Figure 3.2: Dataflow with libPCap

Second disadvantage is, that the filtering (would happen befo re (3)) of libPCap is done in user space. Using the internal filtering of libPCap would be a reduction of data that need to be processed by the proxy and therefore an optimization. On the other hand, the proxy needs to bridge the two Ethernet devices to be transparent. And to achieve this, all Ethernet frames from one device have to be sent out via RAW Sockets on the other device. The filter management of libPCap is useless, because all frame s are needed for bridging the devices. In order to not break an ongoing communication, non e of them should be filtered out on the way from one Ethernet device to the other.

To buffer the TCP traffic, the proxy has to filter the bridged traffic for TCP packets. Filter functionality has to be implemented into the proxy. Compare d to the figure 3.2, all traffic

21

3

Proxy Implementation Design

takes the way (1), (2), (3) from kernel to user space into the libPCap buffer and to the callback of the proxy. Further, the traffic goes the way back via (4) into the kernel space via a RAW Socket to the other Ethernet device. Only the TCP packet s are duplicated into the retransmission buffer of the proxy, which is symbolized by arrow (5).

3.3.2 RAW Sockets

A RAW socket is a socket that can be seen as a direct interface t o the transport or network

layer. It passes a copy of a frame or packet directly to the use r space, before it is processed

by the Ethernet stack or IP stack. It is also possible to send d ata like a TCP packet directly

to the wire without or only partially being processed by the I P Stack. Partially means to cal-

culate checksums in the headers for example. This is a very go od possibility to implement

retransmission for the proxy, but let’s focus back on captur ing.

The current RAW socket interface is supported by the Linux ke rnel since version 2.2.x. Also in version 2.0.x there is a very similar interface, but t his obsolete and deprecated now. Using RAW sockets for capturing was tested with kernel version 2.4.x on a Linksys WRT54G 5 , an embedded Linux router with wireless interface, and on a P C with Linux ker- nel version 2.6.x. Normally a RAW Socket gets only the traffic that is destined to t he MAC addresses of the Ethernet devices owned by the proxy host. To implement the proxy functionality all traffic on the wire needs to be processed by the proxy. Therefore the E thernet devices have to be set into “promiscuous mode” like described in 3.2. Basically the data flow is very similar to an implementation w ith libPCap (3.3.1), but there are no restrictions by a predefined library like limited control of the buffer management. On the other side everything such as protocol classification and buffering has to be imple- mented by the proxy, which causes more processing for the pro xy itself and therefore more workload during development.

A proper design would be multi threaded and provide at least t wo threads. One capture

thread for each Ethernet device. This prevents polling each device alternately. Polling is a type of busy waiting, which should not be used in software. With multiple threads the access to shared information like the state or the presence of a TCP flow has to be managed and protected, therefore mutexes are used. Multi threading and mutexes were implemented since the early prototypes wit h the very handy POSIX Threads library libpthread .

5 see products on http://www.linksys.com

22

3

Proxy Implementation Design

Ethernet Devices eth0 eth1 With Device Driver Netfilter 1 Pointer IP Stack framework Pointer Buffer
Ethernet Devices
eth0
eth1
With Device Driver
Netfilter
1
Pointer
IP Stack
framework
Pointer
Buffer in
Ethernet Stack
Copy with RAW Socket to other eth
To some Application.
Copy with IP Socket to Userspace.
5
Kernel Space
Copy with RAW Socket from one eth
3
User Space
2 (all traffic)
Copy with RAW Socket to other eth
(all traffic)
Pointer
TCP proxy
Capture Function
(only TCP)
4 Retransmission Buffer
Figure 3.3: Dataflow with RAW Sockets

On the figure 3.3 you can see that all traffic takes the way (1), (2) from kernel space to user space directly into the buffer of the proxy. TCP and non TCP tr affic is directly sent out on way (3) via a RAW Socket to the other Ethernet device, to provide bridging and therefore transparency of the proxy.

If the proxy classifies the content of the Ethernet frame as a TCP packet, a pointer to the

current part of the buffer is forwarded on way (4) to the retransmission buffer of the proxy. Further processing like TCP connection tracking is triggered. For every other type of data traffic the part of the buffer, late r (in 3.4.2) defined as chunk , is freed again. If later a retransmission of a buffered TCP pack et is needed, the proxy sends the packet from the retransmission buffer on way (5) via a RAW socket to the Ethernet device.

The main difference from the libPCap design is the optimized buffer management. Instead of copying all traffic to the user space and copying TCP packets a second time from the

libPCap buffer to the retransmission buffer, only a pointer is passed to the other modules of the proxy for further processing and perhaps a retransmis sion. The buffer chunk is reused for non TCP packets and kept for TCP packets, which is also

a small performance enhancement. Compared to the libPCap de sign, exactly one copy

operation in user space is saved, but the problem stays. All t he traffic has to be copied to user space and filtered, classified or modified there.

23

3

Proxy Implementation Design

3.3.3 Netfilter Queue

Before some advantages of Netfilter Queue design can be shown, some information on the Linux Kernel Ethernet Bridge has to be given. This is impor tant because the previous designs had to implement its functionality by forwarding th e traffic from one device to another with RAW sockets.

3.3.4 Kernel Ethernet Bridge

Ethernet bridging aggregates two or more Ethernet devices t o one logical device. Bridged devices are automatically put into “promiscuous mode” by th e kernel. As we already know, this mode of operation tells the Ethernet device to for ward all traffic to the device driver. Not only the traffic which is destined to the own MAC address. All traffic, which is not destined to the host of the proxy, is fo rwarded to the device with the corresponding part of the Ethernet subnet. The bridge learn s where to forward an Ethernet frame from the Source MAC address (see figure 3.4) field in the Ethernet header.

12 A1 B2 C4 D5 E6 12 F1 12 23 34 56 08 00 Destination
12
A1
B2
C4
D5
E6
12
F1
12
23
34
56
08
00
Destination MAC Address
Source MAC Address
EtherType
IP, ARP, etc.
Payload
00
20
20
3A
CRC Checksum
MAC Header
Data
(14 bytes)
(46 - 1500 bytes)
(4 bytes)
Ethernet Type II Frame
(64 to 1518 bytes)
Figure 3.4: Ethernet Type II Frame format

All unknown MAC addresses from the Source MAC address field are stored with some ad- ditional information like a timestamp and the source device in a hash table. This builds a database of MAC addresses linked with the Ethernet device to reach the host. The Ethernet bridge needs to process two cases:

Is a MAC address unknown, the Ethernet frame is forwarded to all other bridged devices. This will normally produce a response from the dest ination host. With such a response, the bridge is able to detect also the corresponding Ethernet device for the previous unknown destination host. In the response there are source and destination exchanged, compared to the first Ethernet frame. The previous destination MAC address can now be learned from the source MAC address of the re sponse. Using this logic, broadcasting to all other devices is needed only for one time.

24

3

Proxy Implementation Design

Is a MAC address already stored in the hash table, the bridge kn ows where to for- ward the Ethernet frame.

3.3.4.1 Netfilter

The Netfilter Queue interface 6 is a part of the Netfilter framework. It is a user space library providing an API to packets that have been queued by the kerne l packet filter. On transport or network layer it can issue verdicts and/or reinjecting altered packets back to the kernel space.

Using the fairly known iptables 7 tool from the Netfilter framework, special filters rules can be set. These filter rules decide which packets are passed to th e user space. The possible ruleset of Netfilter includes every standard type of IPv4 or IP v6 traffic. For the proxy we only need to install one rule, which matches against all TCP t raffic. TCP is the only relevant data for the proxy. Protocol classification is applied already in kernel space by the Netfilter framework. Other traffic than TCP stays in kernel space and is not transferred to the proxy in user space. The memory reservation and the copy operation to the user space for non TCP traffic are saved. The task of pulling network traffic from one Ethernet device to another, which results in being transparent, does not have to be untended in this desig n. Due to the fact that filtering and protocol classification can be applied in the kernel space by the Netfilter framework, but controlled from the user space , it is possible to use the build-in Kernel Ethernet Bridge which was described at the be ginning of this section.

Netfilter makes no difference between a physical network device and a logical bridge de- vice. Rules can be set and applied to both of them. The two need ed Ethernet devices are aggregated by the Ethernet Kernel Bridge to one bridge device . For this reason it is also enough to have only one thread for capturing, because there is only one device to capture from. This makes most of the thread synchronization in the wh ole implementation a lot easier and gains a bit more performance, because it saved some overhead. Information about which is the incoming and which is the outg oing physical device of a packet is also provided by Netfilter Queue. This information is needed for retransmission of the packet on the appropriate device.

6 http://www.netfilter.org/projects/libnetfilter_queue/index.html

7 http://www.netfilter.org/projects/iptables/index.htm l

25

3

Proxy Implementation Design

Kernel eth0 eth1 Ethernet Bridge 3 1 Netfilter With Device Driver Pointer IP Stack 2
Kernel
eth0
eth1
Ethernet Bridge
3
1
Netfilter
With Device Driver
Pointer
IP Stack
2 framework
Pointer
Buffer in
To some Application.
Ethernet Stack
4
Copy with IP Socket to Userspace.
Copy with Netfilter Queue
(only TCP traffic)
Kernel Space
5
User Space
Copy with RAW Socket to eth
(only if retransmission needed)
TCP proxy
Retransmission Buffer
Figure 3.5: Dataflow with Netfilter Queue

For all traffic the need of forwarding to another Ethernet device is decided by the bridge. If the decision is positive, the frame takes the way (1) on Fig ure 3.5 and is handed over to the logical Ethernet device which represents the bridge. From here the frame is passed (2) to the Netfilter Framework. Netfilter issues verdicts based on t he installed filter rules and grants (3) the forwarding to another device. In case of TCP tr affic, the packet is copied (4) to the proxy in user space. Netfilter asks the user space application, the proxy, if the packet should be forwarded or modified. Normally the answer is yes for forwarding. Retransmission, if needed, is done with a RAW Socket (5) on th e accordingly Ethernet device. This design saves a few more copying operation betwe en kernel and user space than the other designs.

3.3.5 Conclusion of the Comparison

During development of the proxy all three methods for captur ing were implemented. Just in the order they were presented here. It is a bit like an evolu tionary design with the goals of fixing issues and shorten/enhance the way from the wire into the proxy.

Capturing methods libPCap (3.3.1) and RAW Socket (3.3.2) are very similar except the more effective buffer management of the RAW Sockets design. Technically both designs

26

3

Proxy Implementation Design

use RAW Sockets for capturing, but libPCap has an abstractio n layer above them to sup- port different platforms and to make the handling easier. The libPCap designs lags in the point of needing to have a seco nd buffer in the user space for the TCP traffic. Copying the incoming TCP traffic to the secon d buffer is saved in the plain RAW Socket design, because traffic is directly captured into the retransmission buffer of the proxy. Both designs do protocol classification of the traffic complete ly in the user space, therefore all the traffic has to go the way down from kernel to user space an d back to kernel space. Back to kernel space, because the traffic is forwarded to the ot her Ethernet device to imple- ment bridging. But only a copy of the TCP traffic stays in user sp ace for retransmission, all other traffic is dropped after it was forwarded.

The Netfilter design improves this by doing the classification w ith help of the kernel in kernel space. Only TCP traffic has to go the way down to the user s pace and is buffered there. And only in the case of retransmission or modification it has to go the way back up to kernel space. It also removed some complexity, because th ere is only one capture thread left for exactly one Ethernet device, the logical bridge device. Overhead for thread security and synchronization is also saved.

The following table 3.1 is a resume of the needed copy and memo ry operations. The k stands for kernel space and u for user space. “buffer[u ]” describes the retransmission buffer of the proxy in user space.

Design Description Copy operations Memory allocations k −→ u −→ k libPCap RAW Sockets or
Design
Description
Copy operations Memory allocations
k −→ u −→ k
libPCap
RAW Sockets
or
k −→ u −→ buffer[ u ] −→ k
k −→ buffer[ u ] −→ k
2
or 3
3 or 4
2
3
k pointer
−→ k
Netfilter Queue
or
0
or 1
1 or 2
(k pointer
−→ k ) and ( k −→ buffer[ u ])

Table 3.1: Comparison by counting copy and memory operation s

Finally it can be decided that the Netfilter Queue design is the best choice in focus of performance. It saves many data transfers to the user space and this is very important for higher bit rates. The decision is based on reading source code of the Linux kern el, Netfilter Framework, libPCap and implementing each design during the developmen t of the proxy.

27

3

Proxy Implementation Design

3.4 Module Design of the Proxy Implementation

3.4.1 Operating System requirements

For actually setting up the proxy, the Linux Kernel needs to have the “802.1d Ethernet Bridging” support enabled and the Netfilter Framework for IPv4 must be enabled. The Kernel Ethernet Bridge has to be set up and the filter rule ins talled, which applies to TCP traffic, before any packet can be captured.

To control the “802.1d Ethernet Bridging” extension in the k ernel, the “bridge utilities” 8 are needed. With the following shell commands the needed bridge (br0) starts to forward Ethernet frames from eth0 to eth1 and vice versa.

b

r c t l

addbr

br0

#

a t e b r i d g e

b

r c t l

addi f

br0

e th0

#

c r e add

e t h 0

t o

b r i d g e

b

r c t l

addi f

br0

e th1

#

add

e t h 1

t o

b r i d g e

i f c o n f i g br0 up

# s t a r t t h e d e v i c e

At this point the traffic can pass the proxy host fully transparent, but without any filter- ing or changing of the traffic. The user space proxy applicatio n must be started with the following command.

# <bridge device > <device1 > <device2 > ./ tcp proxy br0 e th0 e th1

Any further initialization is done during the startup of the application, which is described in the following sections for each module separately.

3.4.2 Module: Buffer Manager

The proxy implementation has to buffer every TCP packet till it can be assumed that it has reached its destination host. Therefore memory in the user s pace has to be allocated. Normally this is done with a void *malloc(size_t size) system call which returns a pointer to the allocated memory. After a packet has reached its destination the memory could be deallocated with void free(void *s) system call.

8 http://www.linux-foundation.org/en/Net:Bridge

28

3

Proxy Implementation Design

But each system call gives the control during the runtime of t he application back to the

operating system till the system call returns. This include s a context switching from the user space application to the kernel. Each context switchin g operation takes time, because registers and states are saved.

If allocation and deallocation is done for each and every TCP packet separately this could

lead into a performance problem, because of the high amount o f context switches. There would be two context switches for each buffered packet.

This issue is avoided by the Buffer Manager by allocating a large memory block at the

initialization time. The large memory block is divided into small pieces. Each piece is later used to store a TCP packet. Additional, a data structure specially designed to be ex- changed between the different modules of the implementatio n is used. Such a data struc- ture is called chunk from now on.

A chunk carries all important information of a packet. The packet itself as a pointer to the

right piece in the large buffer memory block and some additio nal management informa- tion. Normally a memory block is retrieved from the operating syst em and returned if the data stored in this block is not needed any more. But if all packets are stored in one large mem- ory block, this block cannot be returned during runtime to th e operating system. All the data would be lost and not only the data from the one packet that is not needed any more, therefore the memory block is only returned during terminat ion of the application.

The Buffer Manager has to keep track which chunk is used at the moment. A
The Buffer Manager has to keep track which chunk is used at the moment. A linked list
which stores pointers to all currently unused chunks is used and is accessed like a FIFO. A
First In, First Out list.
FIFO
RETRIEVED
RETURNED
Chunk 00
Chunk 01
Chunk
Chunk n
by a module
by a module

Figure 3.6: Logical structure: Linked list used as a FIFO

During initialization of the Buffer Manager all chunks are added to the FIFO list.

If another module needs a free chunk to store data, the Buffer Manager is asked. As you

see on Figure 3.6, the Buffer Manager always removes the first element from the list and gives the pointer to the chunk to the demanding module. A retu rned chunk is appended

at the end of the list.

29

3

Proxy Implementation Design

For retrieving a free chunk from the list, the implementatio n just follows and backups the FIRST pointer (see Figure 3.7 as reference for the pointer names) t o “Chunk 00”. At the end of the retrieving operation, the backup pointer to “Chunk 00” is passed to the demanding module. “Chunk 01” becomes the new first element of the list, th erefore the value of the NEXT pointer from “Chunk 00” to “Chunk 01” is copied into the FIRST pointer.

Linked List NEXT NEXT NEXT Chunk 00 Chunk 01 Chunk Chunk n LAST FIRST
Linked List
NEXT
NEXT
NEXT
Chunk 00
Chunk 01
Chunk
Chunk n
LAST
FIRST

Figure 3.7: Implementation: Linked list at initial state

After the retrieval of “Chunk 00” the linked list looks like F igure 3.8 and a pointer to “Chunk 00” can be passed to the demanding module.

Linked List NEXT NEXT Chunk 00 Chunk 01 Chunk Chunk n LAST FIRST
Linked List
NEXT
NEXT
Chunk 00
Chunk 01
Chunk
Chunk n
LAST
FIRST

Figure 3.8: Implementation: Linked list after retrieval of a chunk

For returning “Chunk 00” back to the list, the implementatio n just follows the LAST pointer to “Chunk n”. The NEXT pointer of “Chunk n” is adjusted to “Chunk 00”. Finally, afte r adjusting the LAST pointer to “Chunk 00”, the return operation is finished. Figure 3.9 shows the order of list elements and pointer adjustments aft er retrieving and returning “Chunk 00”.

As a small summary, it can be said that simply by only using the FIRST pointer for re- trieving an element and using only the LAST pointer for returning an element, no mutex is needed to protect this list from asynchronous manipulation. This prevents to block an- other thread if two or more simultaneous threads want to retr ieve a chunk. Also having a LAST pointer enhances the returning of a chunk a lot. Without this pointer, the whole list needs to be iterated to find the last element and add an element at the end of the list. With the LAST pointer it can be added directly.

30

3

Proxy Implementation Design

Linked List NEXT NEXT NEXT Chunk 01 Chunk Chunk n Chunk 00 LAST FIRST
Linked List
NEXT
NEXT
NEXT
Chunk 01
Chunk
Chunk n
Chunk 00
LAST
FIRST

Figure 3.9: Implementation: Linked list after returning the chunk

Simultaneous retrieving and returning is possible without using mutexes for thread syn- chronization, if there is more than one element in the list. T his is an improvement if some processing of the chunks happens in parallel to the capturin g of packets. That is the case for this proxy implementation.

3.4.3 Module: Netfilter Queue Interface

Capturing the TCP traffic is done in this module. Therefore a callback is registered with the libnetfilter_queue library. The Linux kernel keeps a linked list with callbacks . With the libnetfilter_queue library it is possible to create an entry in this list. Always if some packet matches a Netfilter filter rule and “QUEUE” is present as the action for this rule, these callbacks are called one after another in registration order.

The TCP traffic filter rule is installed during the initialization of the Netfilter Queue Inter- face module. This happens after the registration of the callback . If the proxy application is terminated by some reason, the rule is automatically delete d again. This is very important, because if a filter rule with “QUEUE” as action is installed and matches a packet, the Netfilter Framework tries to ask a waitin g application in user space

if this packet should be dropped. But if no application is pre sent to tell the framework that

a packet has to be accepted, it is dropped.

The following box shows the rule which is installed by the pro xy.

# command t o i n s t a l l t h e TCP f i l t e r r u l e i p t a bl e s A FORWARD p t cp j QUEUE

With this rule all forwarded (“-A FORWARD”) TCP traffic (“-p tcp”) is handled by the Netfilter Queue (“-j QUEUE”). The packets of interest are in th e “FORWARD” chain of Netfilter, because the Kernel Ethernet Bridge is used.

31

3

Proxy Implementation Design

The libnetfilter_queue library supports three copy modes to transport data from ker nel space to user space.

NFQNL_COPY_NONE - Do not copy any data

NFQNL_COPY_META - Copy only packet meta data

NFQNL_COPY_PACKET - Copy entire packet

NFQNL_COPY_META would be enough to realize a connection tracking in user space and would highly reduce the amount of data that has to be copied to the user space. Only the IP and TCP headers are copied to user space in this mode. This is enough to identify and track a TCP flow, but for the prox y implementation only the NFQNL_COPY_PACKET mode is the right choice. In this mode t he whole packet is transferred to the user space and can be buffered there by the proxy for retransmission.

To capture a single packet a blocking function of the libnetfilter_queue library must be called. The process of capturing is defined in eight steps:

1. Fetch a free buffer chunk from the Buffer Manager

2. Request the next packet with libnetfilter_queue (blocking)

3. Registered callback is triggered

4. Packet is passed to the TCP Connection Tracker

5. Packet is passed to the Connection Manager

6. Return to callback

7. Return from “Request the next packet”

8. Go to step 1.

Step three does not only retrieve the TCP packet from the kern el space and stores it in a buffer, also it’s length and other needed information for further p rocessing is gathered and stored. Pointers to the IP and TCP headers are calculated and set and these pointers are stored in the chunk data structure as ip_header and tcp_header, which is shown in the following box “Chunk data structure”.

32

3

Proxy Implementation Design

Chunk data structure

s t r u c t buf_man_cap_buf_chunc

{

 

s

t r u c t

li s t _ h e a d

l i s

t ; / l i n k e d

l i s t

management /

char b u f f e r ; / p o i n t e r t o

b y t e

a r r a y wi t h IP p a c k e t /

unsigned i n t

leng th ; / l e n g t h o f w hol e

p a c k e t /

s

t r u

c t

d e vi ce _in f o ou t _de vice

;

/

o u t p u t d

e

v i c e /

s

t r u c t

ip ip _he ade r ; / p o i n t e r t o IP h e a d e r

/

s

t r u c t

tcphdr tcp _he ade r

;

/

p o i n t e r t o TCP

h e a d e r /

unsigned i n t t cp _se q ; / TCP s e q u e n c e number /

i

n t

n fq _id

; / N e t f i l t e r Queue

ID /

 

enum c h u n c _ s t a t e s t a t

e ;

/

/

( t

e

v

_ s

e c = s e c o n d s ,

t c _ u s e c = m i c r o s e c o n d s )

/

s

t

a t

r

e c e

i

v e AND s e n t /

 

s

t r u c t

time v al times tamp t

;

/ g e n e r a l tim e s t am p /

} ;

This is done in preparation especially for the TCP Connection Tracker module but also for all other modules that need direct access to these headers. T he Netfilter Queue Interface module is the first module in the processing chain of the proxy implementation, therefore

it makes sense to set these pointers here.

Additionally the timestamp is set to the current time, which is the capture time with a resolution of microseconds. This time is needed to calculat e the Round Trip Time (RTT) of

a packet. RTT is calculated later in the Connection Manager.

To make retransmission possible, the outgoing physical device of a packet must be known. The outgoing physical device is retrieved from Netfilter and s tored in out_device of the chunk structure. Referencing a packet while using the Netfilt er Queue interface is done with a Netfilter ID. Later in the TCP Connection Tracker module or in the Connection Manager module it is decided if a packet should be dropped, forwarded or modified and therefore the Netfilter ID is needed to reference a specific pack et. This is achieved by stor- ing the Netfilter ID in nfq_id and providing three functions for a better handling of pack- ets. Each function takes a pointer to a chunk, which carries t he packet, as parameter. The

33

3

Proxy Implementation Design

Netfilter ID is automatically taken from the chunk data struct ure by these functions.

They are called:

i n e t f i l t e r _ s i g n a l _ a c c e p t ( s t r u c t buf_man_cap_buf_chunc chunc ) ;

n t

i n e t fil t e r _ s i g n al _ a c c e p t _ b u t _m o di fi e d ( s t r u c t buf_man_cap_buf_chunc

n t

chunc

) ;

i n e t f i l t e r _ s i g n a l _ d r o p ( s t r u c t buf_man_cap_buf_chunc chunc ) ;

n t

The names are somewhat self explaining, but let’s leave some words about them. netfilter_signal_accept and netfilter_signal_drop just signals Netfilter to forward or drop the packet. netfilter_signal_accept_but_modified awaits the modified packet in the chunk and passes the modified packet to kernel space, which is then forwarded.

If a packet arrives, the values of source port, destination p ort, SEQ number and ACK num- ber is stored in Network Byte Order. The sequence number is looked up by several modules of the imp lementation. But for comparing the sequence number to another value, it has to be in Host Byte Order. Host Byte order differs on x86 hosts from Network Byte Order. The B yte Order is reversed. Reversing the Byte Order each time a value is compared is less effective, if this happens more than one time. To enhance this a bit, the value of the sequ ence number is reversed by the Netfilter Queue interface module and stored in tcp_seq of the chunk data structure. After all important information is gathered and stored into the chunk data structure, the chunk is handed over to the TCP Connection Tracker module.

3.4.4 Module: TCP Connection Tracker

The TCP Connection Tracker module identifies a TCP flow and looks up the correspond- ing management data structure. If the flow is unknown to the proxy, a new management data structure is created and initialized. Tracking of TCP flows is stateful and this means a state per flow is maintained and only transitions accordingly to the TCP standard are allowed. Packets that are violating a state- ful transition are ignored. Only new flows are picked up for tr acking during their connec- tion establishment. Picking up an established flow is possible with the implemented state machine, but not yet implemented by the rest of the proxy.

34

3

Proxy Implementation Design

After identification of a flow, the newly arrived packet is adde d to a per-flow cache. The packet and the corresponding management data structure is h anded over to the Connec- tion Manager module. Hashing is used for faster lookups and identification of a flow. Implementation of the pri- mary hash function was taken from the Linux kernel. It is called jhash 9 and was developed by Bob Jenkins. Attributes of jhash are fastness and good mixing of the input. It was also published in the famous The Dr. Dobbs 10 computer magazine.

During initialization of the module, an array which is used as hash table is created.

Elements of the hash table are called buckets and each bucket is a linked list and used for storing pointers to management structures of TCP flows.

A primary and a secondary hash function are used to distribut e data in the hash table.

The primary hash function is jhash and the secondary function is a modulo operation with the size of the hash table as value. For the size of the hash table a prime number is chosen. Using prime numbers for the hash tables size with a modulo ope ration as secondary hash function is a good idea, because it minimizes clustering in t he hashed table (see 3.4.4.1).

The TCP State Machine is imported from the Netfilter Framework and described further

in section 3.4.6.1. Netfilter has a TCP State Machine implement ed to use stateful packet in-

spection for the Linux kernel packet filter. Reasons for takin g or adapting source code from other implementations are - these implementations have pro ofed their value and many

eyes have looked on the source code. It is not wise to always re invent the wheel each time and do probably the same mistakes other people already did in the past.

3.4.4.1 Prime Numbers for the Secondary Hash Function

Normally we tend to use a value of 2 n as size for arrays, because we like and know these numbers. A programmer has a good impression to the proportio n of these values. He can compare with the sizes of memory or harddisks of a PC. Let’s call the size of the table S and the result of the primary hash function H . The secondary hash function is then ( H mod S ) . Then what makes ( H mod S ) a good distributing hash function?

Let the size S be divisible by 2, remember the 2 2 which matches this specification. Then any time H is divisible by 2, ( H mod S ) will be divisible by 2. In opposite any time H is not divisible by 2, ( H mod S ) will not be either.

9 http://burtleburtle.net/bob/hash/

10 http://www.ddj.com/

35

3

Proxy Implementation Design

This means, by applying the secondary hash function, even nu mbers hash to even indexes and odd numbers hash to odd indexes.

If S was also divisible by 3, then multiples of 3 would hash to mult iples of 3 and non-

multiples of 3 would hash to non-multiples of 3. We would expect that half the numbers to be even and the other h alf to be odd. Unfortunately, this is very unlikely because a sample set is more likely to be biased. Espe- cially the smaller it is. This results in the problem that the secondary hash is perpetuating this fact instead of reducing it by mixing the values.

Therefore in general it is better to use a prime number for the size of a hash table, because

it has only itself as factor.

3.4.4.2 Identify a TCP flow

A flow or connection is identified by a tuple of source IP (srcIP ) and destination IP (dstIP )

address and especially by the source port (srcPort) and destination port (dstPort) number.

flow tupel = (srcIP , dstIP , srcPort, dstPort)

For IP version 4 (IPv4) the srcIP and dstIP are 32 Bit values. srcPort and dstPort are 16 Bit values.

Sample for an IP address:

IP address (4 bytes) binary as octets unsigned 32 bit integer 130.94.122.195 10000010 01011110 01111010
IP address (4 bytes)
binary as octets
unsigned 32 bit integer
130.94.122.195 10000010 01011110 01111010 11000011
2187229891

Normal PCs are working mostly in the x86 mode, which implies a binary compatibility with the 32-bit instruction set. Each machine instruction can take a 32-bit operand. Work- ing with 32-bit operands on PCs in x86 mode is the most effective usage of this hardware, therefore the proxy implementation treats and works with IP addresses and TCP ports as unsigned 32-bit values. This makes, for example comparing, more effective than compar- ing four bytes separately (four bytes like the normal notation).

The proxy sees packets for one flow in two directions. Original sender to original receiver and the responses. This means srcIP and dstIP are exchanged for the response direction. Also srcPort and dstPort are exchanged.

36

3

Proxy Implementation Design

This is a problem, because all packets for origin direction w ould produce another hash value than the reply direction. Two hash values would imply t o search two hash buckets for the corresponding management data structure. By maintaining two management struc- tures, one for each direction of a flow, it is also more difficult to keep track which packets are already acknowledged and can be deleted from the proxy bu ffer.

Feeding the values in a special way into the jhash function, s olves this issue. The jhash function, which is used as primary hash function, is designe d to take one, two or three unsigned 32 bit values as arguments. For the proxy implementation, the three argument version is chosen. srcIP and dstIP are the first two arguments with 32 bit, each. The one IP address of srcIP and dstIP which has as unsigned 32 bit value a higher value is taken as first argumen t for jhash. The third 32 bit value is calculated by adding the two 16 bit values from srcPort and dst- Port. Adding two 16 bit values results as maximum in a 32 bit value, which is exactly the upper bound of the third argument for jhash.

Applying this order to the arguments does the first half of the t rick, because this produces always the same input arguments for jhash and therefore the s ame output for packets of both directions. Second half is summing the two TCP port, because addition is commuta- tive and adding the two 16 bit values from srcPort and dstPort produces always the result value. Regardless of which direction the packet is for.

Taking only srcIP and dstIP as input for the primary hash function would be enough. By also taking the srcPort and dstPort as references, this gives a better distribution over the hash table. Especially if one IP host has many simultaneous connections on different TCP ports. And better distribution results in shorter lookup time. The following sample C code fragment shows how easy the primary and secondary hash function can be implemented. SIZE represents the hash table size and HASH_RND a con- stant, which is demanded by jhash to do a better mixing.

p o r t _ v al = s r c P o r t

+ d s t P o r t ;

i

f

(

(

unsigned i n t ) s r c I P >

( unsigned i n t ) d s t I P )

 

{

 

r

e tu rn

jhash_3words

(

s rc IP

,

ds t IP

,

p o r t _v al

,

HASH_RND)

%

SIZE ;

}

e

l

s e

{

 

r

e tu rn

jhash_3words

(

ds t IP

,

s rc IP

,

p o r t _v al

,

HASH_RND)

%

SIZE ;

}

37

3

Proxy Implementation Design

TCP Connection Tracker module executes mainly the following steps:

1. Get chunk from Netfilter Queue Interface module

2. Generate hash value from packet headers (IP and TCP like de scribed)

3. Search hash bucket identified by hash value

4. Identify (or create new) management data structure in has h bucket

5. Add chunk to the per-flow retransmission buffer

6. Pass chunk and management data structure to Connection Manager

7. Go to 1.

All steps till four are covered by the previous parts of this s ection. Step four has two different behaviors.

For known TCP flows it directly goes to step five. Packets from unknown connections are only accepted if the packet is the first packet of a 3-way connection hand- shake (see figure 3.10). New management data structures are only created and initialized for these packets. On the figure the packet is marked with the arrow from “Client” to “Server” and “syn seq=x” as label. Ignoring other packets is not an implementation fault, but a design decision. Because this means picking up an estab- lished flow and not only new ones. And therefore some information can only be guessed. An example would be, which TCP options are supported by the communicating hosts. But this information is very important for TCP

Snoop. TCP Snoop acts differently if the TCP SACK op- tion is not supported by the hosts. The implementation and the implemented TCP state machine is prepared for picking up connections. This functionality could be upgraded in future work but is not necessary for normal operation.

syn seq=x ack=y+1 seq=x+1 [data] Client 3-way Server Handshake syn ack=x+1 seq=y
syn
seq=x
ack=y+1
seq=x+1
[data]
Client
3-way
Server
Handshake
syn
ack=x+1
seq=y

Figure 3.10: TCP Handshake

A management data structures carries one sub data structure for each direction of a flow with the following information. This makes two for a flow.

38

3

Proxy Implementation Design

The sub data structure carries the following information:

Count of retransmitted packets (for statistics)

Count of duplicate ACKs (to detect lost packets)

First sequence number seen for this direction (to detect wrap arounds)

Last sequence number seen for this direction (to detect duplicate packets)

Last acknowledge number seen for this direction (to detect n ew ACKs)

Last “end” (Last sequence number + length of TCP data)

Last TCP window advertisement seen for this direction

RTT (for retransmission timers)

TCP Window scale factor

Supported TCP options

These sub data structures must be initialized for each new co nnection.

After that or if the connection is already know, the TCP Connection Tracker continues to step five (see page 38). Step five is described in the next section 3.4.4.3. Step six forwards the chunk and management data structure to the Connection Manager. The Connection Manager completes the stateful tracking with the raw information gath- ered by the TCP Connection Tracker and it’s processing is described in 3.4.6.

3.4.4.3 Retransmission Buffer

Most of the buffered packets are not retransmitted. Only a few are retransmitted. This implies a quick lookup and release from the buffer, if a packe t was acknowledged by a receiver. As we already know, packets are only exchanged between funct ions and modules of the implementation by passing pointers to chunks. The chunks en capsulate the pointer to the actual packet with some additional information about the packet. An intuitive attempt to lookup and free all acknowledged packet would be to iterate a list

39

3

Proxy Implementation Design

with all currently buffered packets. But looking up packets from other connections is very inefficient, therefore the proxy implements a per-flow cache. It consists of two ordered linked lists that store pointers to chunks, ordered ascending by the TCP sequence number of the packets. One list for each communication direction. Adding the packet to the per-connection cache is nearly the s ame effort as adding them to a global list, with all currently buffered packets. Commu nication direction and corre- sponding management structure is already known at step six (see page 38). Based on these facts, it is just the effort of adding an element to a linked list. Except applying the order to the list. But the ordinary case is that packets have a rising s equence number, like defined in the TCP standard. This leads to appending at the end of the list in the ordinary case, because packets are sorted by sequence number already. Only delayed packets have to be fit in sorted and cause some extra work. The ascending order makes freeing of acknowledged an easy jo b. Just iterating the accord- ingly list and free packets from the beginning till a packet w ith a sequence number greater or equal than the acknowledge number. As summary, packets are grouped by flow, additionally by dire ction and ordered by se- quence number. This implements an efficient lookup, which leads to a good performance.

3.4.5 Module: Timer Manager

The Timer Manager is mainly a helper module for the Connection Manager. During ini- tialization of the module, it installs an operation system callback. The operating system sends at a given interval a timer signal, which is caught by th e signal handler of the proxy application and forwarded to the timer the callback.

Intervals, types, count and organization of timers are main ly based on ideas from the BSD TCP/IP Stack [Stevens, chapter 25] and a TCP/IP stack for embedded systems. The BSD TCP/IP Stack was used as a reference. It is very similar to the Linux TCP/IP Stack and a good source for information about TCP timers. There is a fast and a slow timer, like in BSD TCP/IP Stack. The fast one is triggered every 200 milliseconds and the slow one every 400 milliseconds. Compared to the BSD Stack this diffe rs in the value for the slow timer. BSD uses 500 milliseconds as interval for the slow timer. The reason for using 400 milliseconds in this implementatio n, is a more efficient way to implement the timer handling in the user space. For an application it is only possible to

install one timer signal. This would mean to have only one int erval. To solve this, the proxy installs a 200 milliseconds timer and creates the 400 milliseconds timer by doubling

it. In the implementation an alternating function which cre ates a 0, 1, 0, 1, 0,

is used and the output value is checked for every timer signal for being zero or non zero.

, 1, 0 sequence

40

3

Proxy Implementation Design

This function is very simple and defined as follows:

value = value XOR 1

A check for zero and the XOR can be translated by a compiler to very few machine in- structions, which makes this very efficient. The XOR instruct ion and a conditional jump instruction for the value check. There is a big need that this is efficient, because the timer handler is run every 200 milliseconds.

The Timer Manager offers a simple API to the other modules for creating a timer e vent. There are predefined timer actions from which the other module s can choose from. Cur- rently implemented actions are:

Retransmission , which retransmits a TCP packet.

Timewait, which is used during a special state of the implemented TCP s tate machine.

Timeout, which is used to detect timed out TCP flows (no data sent any mo re).

By utilizing a general purpose pointer in the data structure , that stores the information about a timer event, all actions can use the same data structu re. This simplifies to extent the Timer Manager with more actions and makes the management of the different t imer event types more efficient. Casting from the general purpose p ointer to the actual data type for the current action is only done if the action has to be triggered. This simplifies and shorts the loop, which checks if an event action should be triggered.

The timer events are stored in an ordered linked list. One lis t for each of the two timers. Events are ordered by their timestamp, which defines when the t imer action should be triggered. Ordering the events raises the effort to create a new entry in the list, because the right place in the list has to be found. But under the assumption that mainly retransmission timer events are created, this leads in most cases to a position at the end or very near to the end. Main reason for this is the sequential processing of the T CP packets and the timeout calculation. Timeout is defined by current time plus a multiple of the round trip time of a TCP packet, therefore the timeout values grow sequentially. Insertion into the list is done by starting at the end and iter ating backwards till the correct position. Iteration from the end is an efficient way under the p revious assumption. Searching for events that has to be triggered is done by iterating from the start of the list forward until the current time as value is larger than the timestamp value of the currently checked event. This is also very efficient, because only event s that has to be triggered plus one are iterated. Future events don’t have to be checked.

41

3

Proxy Implementation Design

3.4.6 Module: Connection Manager

After the TCP Connection Tracker module looked up the corresponding management structure for a packet both is passed to this module, the Connection Manager module. The functionality of the module can be described as a lightwe ight TCP stack or the TCP connection tracking part of a stateful firewall. As a reference for a standard TCP stack and how a TCP stack work s the BSD implemen- tation was used. The implementation of this stack is described in the book “TCP/IP Illus- trated II. The Implementation” [Stevens]. Ideas for connection tracking are also taken from the Netfilter implementation and a paper from Guido van Rooij [Rooij]. The following text describes connection tracking for TCP Snoop and not general purpose connection tracking.

Main tasks of this module are updating the state of a flow, calcu lating the RTT if possible, installing the retransmission timers, detecting gaps in th e sequence numbers and detecting duplicate ACKs. During the development of the TCP proxy there was the question to have a management structure for each communication direction or only one for the whole TCP flow. For the point of view from the TCP Connection Tracker module it makes sense to have two management structures, because it simplifies the has hing. Management struc- tures for different communication directions are hashed to different hash buckets because of the simplified hashing. Also caching could be realized with the simplified hashing, be- cause for one direction we only need to lookup the management structure for this direction to find a packet for retransmission.

But by ignoring the communication direction of a packet we ge t into serious trouble. It is not sufficient to store each and every TCP packet, because th e buffer of the TCP proxy would just overflow. Acknowledged packets have to be release d from the buffer again. These ACKs are sent by the receiver of a packet, which means the y are sent from the op- posite direction. For the design with two management structures per flow, this means an additional lookup for the second management structure to check if some previou sly received packet was acknowledged and can be released from the buffer. ACKs can only be found in the man- agement structure for the opposite direction. This means in general, both management structures are looked up for each new packet of a flow. Maintain ing only one management structure and having a slightly more costly hash function makes sense, because one lookup should always be cheaper than two lookups.

An additional aspect which is also positive for keeping only one management structure is that state transitions of the state machine for a specific flo w can and must be triggered by traffic from both directions. The flow itself can have only on e state, therefore this im-

42

3

Proxy Implementation Design

plementations maintains only one state in one structure which corresponds to the TCP standard. There is no need to store the state twice or even maintain two different states.

The processing of a single packet can be described as follows :

1. Calculate an index value from TCP flags for the state machin e

2. Determine new state of the flow with the index value and stat e machine

3. Check SEQ number of the current packet

4. Check if ACK is present

Check if it is a duplicate ACK

Calculate RTT if possible

Release acknowledged packets from cache

5. Update state and other values in the management structure

6. Install retransmission timer

7. Signal Netfilter to forward packet

8. Return control to capture module

3.4.6.1 Stateful tracking

Firstly, for realizing a stateful connection tracking a rep resentation for the state of a TCP flow is needed. This is implemented by a numeric value which re presents the current state of a finite state machine. The state machine used in this implementation is shown on figure 3.11 on page 47. It is a simplified version to get an easie r overview. Simplified means that the shown arrows for state transitions with the same attributes but opposite communication direction are combined to one arrow.

Transitions to other states are triggered by flags in the TCP h eader (see figure 2.1 as refer- ence). Flags which can lead to transitions are Synchronize (SYN), Reset (RST), Acknowl- edge (ACK) and Finish (FIN). Every time a new packet arrives its flags are processed and an index value of a possible

43

3

Proxy Implementation Design

transition is calculated from them. The index value and the d irection of communication are fed as input into the state machine to determine the new st ate of a flow. Transitions depend on communication direction and the flags of a packet. To determine the index value for a transition the relevant flag or flag combination is asso- ciated with a numeric value. The numeric value of the first patt ern that applies to the flags of the current packet is used as index value. These patters are in sorted order:

1. RST flag present Sent if one of the communication hosts wants to reset the stat e of the TCP flow.

2. SYN flag and ACK flag present Sent with the second packet of the 3-way-handshake

3. SYN flag only Sent with the first packet of the 3-way-handshake

4. FIN flag present Sent if one of the communication hosts wants to close the TCP flow.

5. ACK flag present Normally present if the corresponding host wants to acknowledge a packet.

Next, the SEQ number of the current packet is checked. This is done to detect gaps in the sequence numbers and out of order packets. To keep track which is the next following SEQ number, the prox y calculates for each packet the sum of the current SEQ number and the size of the packet pay load and stores it into the management structure as a variable called “last_end”. W hen the next packet arrives last_end can be compared with the actual SEQ number. If both values are equal everything went fine. If the SEQ number is greater than last_end there was a packet lost and a gap in the SEQ numbers is detected . For data traffic from the wireless part of the network a retransmission with TCP SA CK is triggered. In this case and if the detected flow also supports TCP SACK, a new packet wit h no payload but the accordingly SACK information is created and sent to the corresponding MH. It is not sent as piggyback with the next packet to minimize the delay.

If the ACK flag is set for the current packet, the ACK number has to be processed. This is needed to detect duplicate ACKs, calculate RTT and release packets from the buffer if they reached their destination. Duplicate ACKs can be detect ed very easy, because the proxy only checks if the current ACK number is equal to the las t ACK number which

44

3

Proxy Implementation Design

is still stored in the management structure. The amount of du plicate ACKs are counted and also stored in the management structure. If a successful ACK is detected this counter is reset to zero. How the RTT is calculated is described in the next section 3.4.6.2. For releasing acknowledged packets from the buffer, the proxy iterates for buffered packets for the opposite communication direction till the SEQ of the buffered packet is greater or equal to the ACK number of the current packet. This can be done very efficient, because the linked list with the buffer chunks is sorted by the SEQ numbers of the buffered packets. Iterating the whole list is only needed if all buffered packe ts can be released.

With the current RTT value for the current communication direction a retransmission time- out (RTO) is calculated. RTO is calculated by the simple formula RTO : = t + 4 RTT with t as current time. Finally, all gathered values and other information is store d into the management structure of the current flow. There are sub-structures for communicat ion direction dependent infor- mation such as ACK or SEQ numbers. A retransmission timer wit h the calculated RTO is installed and via a function of the Netfilter Queue Interface (see page 34) it is signaled to forward the packet. After that the Ethernet bridge forwards the packet and it can reach it’s destination.

3.4.6.2 Round Trip Time calculation

RTT defines the runtime a packet needs to reach it’s destinatio n plus the runtime for it’s acknowledgment. The runtime for an acknowledgment is also a packet runtime, because the acknowledgment is sent as a separate packet or along with another packet.

For calculating RTT passively on the TCP proxy, every captured packet needs a timestamp. With a timestamp on each and every through passing TCP packet , a packet and also it’s acknowledgment are getting a timestamp. If the Connection Manager module detects an acknowledgment it tries to release the ac- knowledged packet from the buffer. Shortly before releasin g the packet, it looks up the timestamp of the origin packet and the timestamp of the ackno wledgment packet. By cal- culating the difference of the two timestamps, the RTT of a packet from the TCP proxy to a host is determined.

RTT values are stored separately for each communication direction of a flow, because the bandwidth of the wireless link is normally slower than the wired link which creates an asymmetry. For the faster link the RTT values are probably lo wer and therefore it makes sense to store two RTT values for a flow.

45

3

Proxy Implementation Design

From the perspective of one of the communication hosts, the RTT would be the sum of the two separate values stored by the TCP proxy. The TCP proxy implementation uses a simple algorithm to dete rmine the RTT. If a retrans- mission is in progress the RTT calculation is paused to eliminate difficulties. For example, during a retransmission a delayed acknowledgment could be received and create a very low RTT value because the retransmitted packet just got away. Every new RTT value is smoothed by 25% to be more resistant against fluctuations and measurement errors. This is done by the formula

RTT

=

(