Indian Institute of Technology, Roorkee: Progress Report High Performance Intrusion Detection System

Indian Institute of Technology, Roorkee
Progress Report High Performance Intrusion Detection System
Project Guide
Dr. Anjali Sardana Asst. Proessor Electronics and Computer Science Department
Submitted by
Mohd Junaid Siddiqui 11536016
Abstract The constant increase in link speeds and number of threats poses challenges to network intrusion detection systems (NIDS), which must cope with higher trac throughput and perform even more complex per-packet processing. It is becoming increasingly dicult to implement eective systems for preventing network attacks.There is a growing demand for network devices capable of examining the content of data packets in order to improve network security and provide application-specic services. High-speed packet content inspection and ltering devices rely on a fast multi-pattern matching algorithm which is used to detect predened keywords or signatures in the packets. Pattern matching is a highly computationally intensive operation used in a plethora of applications. Unfortunately, due to the ever increasing storage capacity and link speeds, the amount of data that needs to be matched against a given set of patterns is growing rapidly. Most high performance systems that perform deep packet inspection implement simple string matching algorithms to match packets against a large, but nite set of strings. However, there is growing interest in the use of regular expressionbased pattern matching, since regular expressions oer superior expressive power and exibility, Hence specialized hardware-accelerated algorithms are being developed for line-speed packet processing. While several pattern matching algorithms have already been developed for such applications. Nowadays the concern is to nd a solution so that the highly parallel computational capabilities of commodity graphics processing units (GPUs) can be exploited for high-speed pattern matching. The highly parallelism of the GPU computation power is used to inspect the packet content in parallel, one such architechture is provided by NVIDIA CUDA.
Contents
1 Introduction 2 Issues in High Performance Computing Architechture 2.1 2.2 2.3 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Regular Expression and DFA . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 3 4 5 7 7 7 9 9 9 11 . . . . . . . . . . . . . . . . . . 11 12 12 15 16
3 Optimizations 3.1 3.2 Based on minimizing the communication between CPU and GPU . . . . . . 3.1.1 3.2.1 3.2.2 Transfering Network Packet to the GPU . . . . . . . . . . . . . . . . Data storage in GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . Delayed Input DFAs (D2 FAs) . . . . . . . . . . . . . . . . . . . . . . Based on minimizing the computational cost . . . . . . . . . . . . . . . . . .
4 Classication of HPCA for IDS 4.1 4.2 Based on minimizing the computation cost
Based on minimizing the communication cost incurred between CPU and GPU 11
5 Algorithms for minimizing computational cost over GPU 5.1 Converting DFAs to D2 FAs . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Conclusion References
ii
Introduction
Intrusion detection is the act of detecting unwanted trac on a network or a device. An IDS can be a piece of installed software or a physical appliance that monitors network trac in order to detect unwanted activity and events such as illegal and malicious trac, trac that violates security policy, and trac that violates acceptable use policies. Many IDS tools will also store a detected event in a log to be reviewed at a later date or will combine events with other data to make decisions regarding policies or damage control. An IPS is a type of IDS that can prevent or stop unwanted trac. The IPS usually logs such events and related information. An IDS can be classied into two categories viz, A Host based Intrusion Detection System (HIDS) and Network Based Intrusion Detection System (NIDS). Network Intrusion Detection System (NIDS) is one common type of IDS that analyzes network trac at all layers of the Open Systems Interconnection (OSI) model and makes decisions about the purpose of the trac, analyzing for suspicious activity. Most NIDSs are easy to deploy on a network and can often view trac from many systems at once. Host-based intrusion detection systems (HIDS) on the other hand analyze network trac and system-specic settings such as software calls, local security policy, local log audits, and more. A HIDS must be installed on each machine and requires conguration specic to that operating system and software. Moreover both these can employ dierent detecting techniques which are categorized as signature based detection and anomaly based detection. An IDS can use signature-based detection, relying on known trac data to analyze potentially unwanted trac. This type of detection is very fast and easy to congure. However, an attacker can slightly modify an attack to render it undetectable by a signature-based IDS. Still, signature-based detection, although limited in its detection capability, can be very accurate. An IDS that looks at network trac and detects data that is incorrect, not valid, or generally abnormal is called anomaly-based detection. This method is useful for detecting unwanted trac that is not specically known. For instance, an anomaly-based IDS will detect that an Internet protocol (IP) packet is malformed. It does not detect that it is malformed in a specic way, but indicates that it is anomalous. Ever increasing storage capacity and link speeds, the amount of data that needs to be searched, analyzed, categorized, and ltered is growing rapidly. For instance, network monitoring applications, such as network intrusion detection systems and spam lters, need to scan the contents of a vast amount of network trafc against a large number of threat signatures. Signature-based Network Intrusion detection systems (NIDS) have been widely deployed to protect networks from attacks. The pattern matching algorithm used to deeply inspect packet content dominates the performance of NIDS and may become the bottleneck of NIDS in high speed network environments. Most high performance systems that perform deep packet inspection implement simple string matching algorithms to match packets against a large, but nite set of strings. However, there is growing interest in the use of regular expression-based pattern matching, since regular expressions oer superior expressive power and exibility. Deterministic nite automata (DFA) representations are typically 1
used to implement regular expressions. An important class of algorithms used for searching and ltering information relies on pattern matching. Pattern matching is one of the core operations used by applications such as trac classication, intrusion detection systems, virus scanners, spam lters, and content monitoring lters. Unfortunately, this core and powerful operation has signicant overheads in terms of both memory space and CPU cycles, as every byte of the input has to be processed and compared against a large set of patterns. A possible solution to the increased overhead introduced by pattern matching is the use of hardware platforms, although with a high and often prohibitive cost for many Most high performance systems that perform deep packet inspection implement simple string matching algorithms to match packets against a large, but nite set of strings. However, there is growing interest in the use of regular expression-based pattern matching, since regular expressions oer superior expressive power and exibility. Deterministic nite automata (DFA) representations are typically used to implement regular expressions.organizations. Specialized devices, such as ASICs and FPGAs, can be used to inspect an input data stream and ooad the CPU. Both are very ecient and perform well, however they are complex to program and modify, and they are usually tied to a specic implementation. The advent of commodity massively parallel architectures, such as modern graphics processors, is a compelling alternative option for inexpensively removing the burden of computationally-intensive operations from the CPU. The data-parallel execution model of modern graphics processing units (GPUs) is a perfect t for the implementation of high-performance pattern matching algorithms. GPU-based pattern matching engine enables content scanning at multi-gigabit rates, and allows for real-time inspection of the large volume of data transferred in modern network links. The original purpose of the graphics processor is computer graphics applications such as 3D processing for games. The demands for 3D animation drive graphics processors to do real-time, smooth and vivid rendering jobs. It results in that the computation power of modern graphics processors has been increased dramatically in recent years, even surpassing that of general processors in oating point computation. The amazing computation power of graphics processors derives from the parallel computing ability bymultiple stream processors. Such computation power also catches the eyes of the developers of non computer game or graphics elds. The development of non graphics applications has been started for a while, this kind of applications are called General Purpose Computations on Graphics Processor Units (GPGPU). As Graphics Processing Units (GPUs) are becoming increasingly powerful and ubiquitous, researchers have begun exploring ways to tap their power for non-graphic or general-purpose (GPGPU) applications. The main reason behind this evolution is that GPUs are specialized for computationally-intensive and highly parallel operations - required for graphics rendering - and therefore are designed such that more transistors are devoted to data processing rather than data caching and ow control. The release of software development kits (SDKs) from big vendors, like NVIDIA and ATI has started a trend of using GPUs as a computational unit to ooad the CPU.
Figure 1: CPU and GPU architecture
Issues in High Performance Computing Architechture

GPU Architecture
2.1
Driven by the insatiable market demand for real-time, high-denition 3D graphics, the programmable Graphic Processor Unit or GPU has evolved into a highly parallel, multithreaded, many core processor with tremendous computational horsepower and very high memory bandwidth. The reason behind the discrepancy in oating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel and therefore designed such that more transistors are devoted to data processing rather than data caching and ow control. More specically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel - with high arithmetic intensity - the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated ow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches. The GPU consists of a large number of shader processors, and conceptually operates as a Single Instruction Multiple Data (SIMD). Modern graphics processing units (GPUs) have been at the leading edge of increasing chip-level parallelism for some time. Current NVIDIA GPUs are many-core processor chips, scaling from 8 to 240 cores. This degree of hardware parallelism reects the fact that GPU architectures evolved to t the needs of real-time computer graphics, a problem domain with tremendous inherent parallelism. Modern graphics processing units (GPUs) have evolved to massively parallel computational devices, containing hundreds of processing cores that can be used for general3
purpose computing beyond graphics rendering. The fundamental dierence between CPUs and GPUs comes from how transistors are assigned to dierent tasks in the processor. A GPU devotes most of its die area to a large array of Arithmetic Logic Units (ALUs). In contrast, most CPU resources serve a large cache hierarchy and a control plane for sophisticated acceleration of a single thread. The architecture of modern GPUs[6] is based on a set of multiprocessors, each of which contains a set of stream processors operating on SIMD (Single Instruction Multiple Data) programs. For this reason, a GPU is ideal for parallel applications requiring high memory bandwidth to access dierent sets of data. Both NVIDIA[10] and AMD provide convenient programming libraries to use their GPUs as a general purpose processor (GPGPU), capable of executing a very high number of threads in parallel. A unit of work issued by the host computer to the GPU is called a kernel. A typical GPU kernel execution takes the following four steps: (i) the DMA controller transfers input data from host memory to GPU memory; (ii) a host program instructs the GPU to launch the kernel; (iii) the GPU executes threads in parallel; and (iv) the DMA controller transfers the results data back to host memory from device memory. A kernel is executed on the device as many dierent threads organized in thread blocks, and each multiprocessor executes one or more thread blocks. A fast shared memory is managed explicitly by the programmer among thread blocks. The global, constant,and texture memory spaces can be read from or written to by the host, are persistent across kernel launches by the same application, and are optimized for dierent memory usages.
2.2
Pattern Matching
String searching and regular expression matching are two of the most common pattern matching operations [2]. In string searching, a set of xed strings is searched in a body of text. Regular expressions, on the other hand, oer signicant advantages, providing exibility and expressiveness in specifying the context of each match. In addition to matching strings of text, they oer wild-card characters, logical operators, repeating patterns, range constraints, and recursive forms. Thus, a single regular expression can cover a large number of individual string representations. Both string patterns and regular expressions can be matched efciently by compiling the patterns into a Deterministic Finite Automaton (DFA). A sequence of bytes can be processed using O(n) operations irrespectively of the number of patterns, which is very ecient in terms of speed. This is achieved because at any state, every possible input byte leads to at most one new state. Aiming to take advantage of the extreme thread-level parallelism of modern GPUs, we have to parallelized the DFA-based matching process by splitting the input data stream into dierent chunks. Each chunk is scanned independently by a dierent thread using the same automaton that is stored in device memory. Although threads use the same automaton, each thread maintains its own state, eliminating any need for communication between them.
Name Reg Ex Epsilon Character Concatenation RS Alternation Kleene star R|S A*
Designation {} For some character Denoting the set { | in R and in S }. e.g., {ab}{d,ef}={abd,abef} Denoting the set union of R and S e.g., {ab}|{ab,d,ef}={ab,d,ef} Denoting the smallest superset of R that contain and is closed under string concatenation. This is the set of all strings that can be made by concatenating all the strings in R
Table 1: Regular expression operations.
2.3
Regular Expression and DFA
A regular expression is a very convenient form of representing a set of strings. They are usually used to give a concise description of a set of patterns, without having to list all of them. For example, the expression (a | b) aa represents the innite set {aa, aaa, baa, ...}, which is the set of all strings with characters a and b that end in aa. Formally, a regular expression contains at least one of the operations described in Table 1 A deterministic nite automaton (DFA) represents a nite state machine that recognizes a regular expression. A nite automaton is represented by the 5-tuple (, Q, T, q0 ,F ), where: is the alphabet, Q is the set of states, T is the transition function, q0 is the initial state, and F is the set of nal states. Given an input string I0 I1 I2 ...IN ,a DFA processes the input as follows: At step 0, the DFA is in state s0 = q0 . At each subsequent step , the DFA transitions into state si = T(s(i1 , Ii ). To alleviate backtracking at the matching phase, each transition is unique for every state and character combination. A DFA accepts a string if, starting from the initial state and moving from state to state, it reaches a nal state. The transition function can be represented by a two-dimensional table T , which denes the next state T [s,c] for a state sand a character c. For example, the regular expression (abc+)+ is recognized by the DFA shown in Figure 2. The automaton has four states, state 0 is the start state, and state 3 is the only nal state. Many existing tools that use regular expressions, such as grep(1), flex(1) and pcre(3), have support for converting regular expressions into DFAs. The most common approach is to rst compile them into non-deterministic nite automata (NFAs), and then convert them into DFAs. Each regular expression can be converted into a NFA using the Thompson algorithm. The generated NFA is then converted to a DFA incrementally, using the Subset Construction algorithm. The basic idea of subset construction is to dene a DFA in which each state is a set of states of the corresponding NFA. The resulting DFA achieves O(1) computational cost for each consumed character of the input during the matching phase. Each DFA is represented as a two-dimensional state table array that is mapped on 5
Figure 2: The DFA state machine
Figure 3: state transition table
the memory space of the GPU. The dimensions of the array are equal to the number of states and the size of the alphabet, respectively. Each cell contains the next state to move to, as well as an indication of whether the state is a nal state or not.
3
3.1
3.1.1
Optimizations
Based on minimizing the communication between CPU and GPU
Transfering Network Packet to the GPU
The rst thing to consider is the transfer of the packets to the memory space of the GPU[2]. A major bottleneck for this operation, is the extra overhead, caused by the PCIe bus that interconnects the graphics card with the base system. Unfortunately, the PCIe bus suers many overheads, especially for small data transfers although with a large buer, the rate for transferring to the device is minimized. As a consequence, network packets are transferred to the memory space of the GPU in batches. A separate packet buer is allocated to collect the incoming packets. Whenever the buer gets full, all packets are transferred to the GPU in one operation. The format of the packet buer plays a signicant role in the overall packet processing throughput. First, it aects the transferring overheads, as small data transfer units achieve a reduced bandwidth due to PCIe and DMA overheads. Second, the packet buer scheme aects the parallelization approach, i.e., the distribution of the network packets to the stream processors. The simplest the buer format, the better the parallelization scheme. There are two dierent approaches for collecting packets. The rst uses xed buckets[2] as shown in Figure 4 for storing the network packets, and has been previously adapted in similar works. The second approach uses a more sophisticated, indexbased[2], scheme. Instead of pre-allocating a dierent, xed-size, bucket for each packet, all packets are stored back-to-back into a serial packet buer. A separate index is maintained, that keeps pointers to the corresponding osets in the buer, as shown in Figure 5. Each thread reads the corresponding packet oset independently, using its own thread number, without any lock or synchronization mechanism needed. In order to avoid an extra transaction over the PCIe bus, the index array is stored in the beginning of the packet buer. The packet buer and the indices are transferred to the GPU at once, adding a mi- nor transfer cost, since the size of the index array is quite small in regards to the size of the packet buer.
Figure 4: Packets are stored to dierent buckets
Figure 5: Packets are stored sequentially and indexed by a separate direc- tory.
3.2
3.2.1
Based on minimizing the computational cost

Data storage in GPU
Data are stored in a GPU in the form of textures[5]. A texture is a 2D or 3D array and each element contains the following 4 components: Red, Green, Blue and Alpha (RGBA). Data are transferred over a bus that connects the GPU to the CPU. A simple ow is like this: stream the data to the GPU, perform the computation, and read the data back to the CPU. The dimensions of a texture determine the number of data can be stored in a texture. A texture of width W and height H can store 4 *W * H color data values in that texture if all the four RGBA components are used. The two major tasks of DFA matching, as described in the previous section, is reading the input data and fetching the next state from the device memory. These memory transfers can take up to several hundreds of nanoseconds, depending on the stream conditions and congestion. In general, memory latencies can be hidden by running many threads in parallel. Multiple threads can improve the utilization of the memory subsystem by overlapping data transfer with computation. To obtain the highest level of performance, we performed several tests to determine how the computational throughput is aected by the number of threads. In [2] they have discuss how the memory sub-system is utilized when there is a large number of threads running in parallel. Moreover, we have investigated storing the network packets and the DFA state table both in the global memory space, as well as in the texture memory space of the graphics card. The texture memory can be accessed in a random fashion for reading, without the need to follow any coalescence rules. Furthermore, texture fetches are cached, increasing the performance when read operations preserve locality. In advance, texture cache is optimized for 2D spatial locality; to that end, we have investigated the use of both 1D and 2D textures. A programming limitation whendealing with 2D textures, is that the maximum y-dimension is equal to 65,536. Therefore, in order to map large state tables, we split the initial table into several smaller (each of which contains 64K states at most) and align them sequentially. In order to nd the transitions of a given state at the matching phase, it is rst divided with 65,536 in order to nd the subtable that resides. 3.2.2 Delayed Input DFAs (D2 FAs)
It is well-known that for any regular expression set, there exists a DFA with the minimum number of states. The memory needed to represent a DFA is determined by the number of transitions from one state to another, or equivalently, the number of edges in the graph representation. For an ASCII alphabet, there can be up to 256 edges leaving each state, making the space requirements excessive. Table compression techniques can be applied to reduce the space in situations when the number of distinct next-states from a given state is small. However, in DFAs that arise in network applications, these methods are typically not very eective because on average, there are more than 50 distinct next-states from various 9
states of the automaton. A modication to the standard DFA that can be represented much more compactly is available. These modications are based on a technique used in the AhoCorasick string matching algorithm. Which extend their technique and apply it to DFAs obtained from regular expressions, rather than simple string sets.
10
Figure 6: Data transfer rate between host and device (Gbit/s).
4
4.1
Classication of HPCA for IDS

Based on minimizing the communication cost incurred between CPU and GPU
There are various algorithms available for such optimization. The main aim is to reduced the number of transfer of packets - to be scanned - between CPU and GPU. The rst solution was to transfer the packets in batches rather transfering them individually. The experiments carried out by Giorgos Vasiliadis[2] shows the increase transfer rate by increasing the buer size. See gure 6 As a consequence, network packets are transferred to the memory space of the GPU in batches. A separate packet buer is allocated to collect the incoming packets. Whenever the buer gets full, all packets are transferred to the GPU in one operation. Format of the packet buer plays a signicant role in the overall packet processing throughput. First, it aects the transferring overheads, as small data transfer units achieve a reduced bandwidth due to PCIe and DMA overheads. Second, the packet buer scheme aects the parallelization approach, i.e., the distribution of the network packets to the stream processors. The simplest the buer format, the better the parallelization scheme. In this work, we have implemented two dierent approaches for collecting packets. The rst uses xed buckets for storing the network packets. The second approach uses a more sophisticated, index-based, scheme. Instead of pre-allocating a dierent, xed-size, bucket for each packet, all packets are stored back-to-back into a serial packet buer. Both these schemes are described in details in section 3.1.1
4.2
Based on minimizing the computation cost
Most high performance systems that perform deep packet inspection implement simple string matching algorithms to match packets against a large, but nite set of strings. However, there is growing interest in the use of regular expression-based pattern matching, since regular expressions oer superior expressive power and exibility. Deterministic nite automata (DFA) representations are typically used to implement regular expressions. However,
11
DFA representations of regular expression sets arising in network applications require large amounts of memory, limiting their practical application. In [11] S. Kumar, have introduced a new representation for regular expressions, called the Delayed Input DFA (D2 FA), which substantially reduces space requirements as compared to a DFA. A D2 FA is constructed by transforming a DFA via incrementally replacing several transitions of the automaton with a single default transition. Their approach dramatically reduces the number of distinct transitions between states. For a collection of regular expressions drawn from current commercial and academic systems, a D2 FA representation reduces transitions by more than 95%. Given the substantially reduced space requirements, we describe an ecient architecture that can perform deep packet inspection at multi-gigabit rates. THeir architecture uses multiple on-chip memories in such a way that each remains uniformly occupied and accessed over a short duration, thus eectively distributing the load and enabling high throughput.
Algorithms for minimizing computational cost over GPU
Traditionally, this deep packet inspection has been limited to comparing packet content to sets of strings. State-of-the-art systems, however, are replacing string sets with regular expressions, due to their increased expressiveness. The memory needed to represent a DFA is, in turn, determined by the product of the number of states and the number of transitions from each state. For an ASCII alphabet, each state will have 256 outgoing edges. In [11], they introduced a highly compact DFA representation. Our approach reduces the number of transitions associated with each state. The main observation is that groups of states in a DFA often have identical outgoing transitions and we can use this duplicate information to reduce memory requirements. For example, suppose there are twostates s1 and s2 that make transitions to the same set of states, {S}, for some set of input characters, {C}. We can eliminate these transitions from one state, say s1, by introducing a default transition from s1 to s2 that is followed for all the characters in {C}. Essentially, s1 now only maintains unique next states for those transitions not common to s1 and s2 and uses the default transition to s2 for the common transitions. We refer to a DFA augmented with such default transitions as a Delayed Input DFA (D2 FA).
5.1
Converting DFAs to D2 FAs
Although, we are in general interested in any equivalent D2 FA, for a given DFA, there is no general procedure for synthesizing a D2 FA directly. Consequently, our procedure for constructing a D2 FA proceeds by transforming an ordinary DFA, by introducing default transitions in a systematic way, while maintaining equivalence. Our procedure does not change the state set, or the set of matching patterns for a given state. Hence, we can 12
Figure 7: Automata which recognize the expressions a+ , b+ c,and c+ d+ maintain equivalence by ensuring that the destination state function (x), does not change. Consider two states u and v, where both u and v have a transition labeled by the symbol a to a common third state w, and no default transition. If we introduce a default transition from u to v, we can eliminate the a-transition from u without aecting the destination state function (x). A slightly more general version of this observation is stated below. Lemma 1. Consider a D2 FA with distinct states u and v, where u has a transition labeled by the symbol a, and no outgoing default transition. If (a, u)= (a, v), then the D2 FA obtained by introducing a default transition from u to v and removing the transition from u to (a,u) is equivalent to the original DFA. Note that by the same reasoning, if there are multiple symbols a, for which u has a labeled outgoing edge and for which (a, u)=(a, v), the introduction of a default edge from u to v allows us to eliminate all these edges. Our procedure for converting a DFA to a smaller D2 FA applies this transformation repeatedly. Hence, the equivalence of the initial and nal D2 FAs follows by induction. The D2 FA on the right side of Figure 1 was obtained from the DFA on the left, by applying this transformation to state pairs (2,1), (3,1), (5,1) and (4,1). For each state, we can have only one default transition, so its important to choose our default transitions carefully to allow us to get the largest possible reduction. We also restrict the choice of default transitions to ensure that there is no cycle dened by default transitions. With this restriction, the default transitions dene a collection of trees with the transitions directed towards the tree roots and we can identify the set of transitions that gives the largest space reduction by solving a maximum weight spanning tree problem in an undirected graph which we refer to as the space reduction graph. The space reduction graph for a given DFA is a complete, undirected graph, dened on the same vertex set as the
13
DFA. The edge joining a pair of vertices (states) u and v is assigned a weight w(u,v) that is one less than the number of symbols a for which (a, u)=(a, v). Notice that the spanning tree of the space reduction graph that corresponds to the default transitions for the D2 FA in Figure 7 has a total weight of 3+3+3+2=11, which is the dierence in the number of transitions in the two automata. Also, note that this is a maximum weight spanning tree for this graph.
procedure renedmaxspantree (graph G=(V, W), modies set edge default);
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) end;
vertex u, v; set edges; set weight-set[255]; default := {}; edges := W; for edge (u, v) edges if weight(u, v) > 0 add (u, v) to weight-set[weight(u, v)]; fi for integer i = 255 to 1 do weight-set[i] [ ] Select (u, v) from weight-set[i] which leads to the smallest growth in the diameter of the default tree if vertices u and v belongs to different default trees if default U (u, v) maintains the diameter bound default := default U (u, v); fi fi od rof
14
Conclusion
As far as exploitation of GPUs is concerned we can focus on utilizing multiple GPUs instead of a single one. Modern motherboards support dual GPUs, and there are PCI Express backplanes that support multiple slots. We believe that building such clusters of GPUs will be able to support multiple Gigabit per second Intrusion Detection Systems. More development is needed in memory space eciency, some potential solution are required to convert state transition table into ecient data structures. A new representation for regular expressions, called the delayed input DFA (D2 FA), which signicantly reduces the space requirements of a DFA by replacing its multiple transitions with a single default transition. Since the construction of an ecient D2 FA from a DFA is NP-hard. Therefore heuristics are applied for D2 FA construction that provide deterministic performance guarantees. Results suggest that a D2 FA constructed from a DFA can reduce memory space requirements by more than 95%. Thus, the entire automaton can t in on-chip memories. Since embedded memories provide ample bandwidth, further space reductions are possible by splitting the regular expressions into multiple groups and creating a D2FA for each of them. Also distribution of pattern signatures on the GPU memory, per signature per thread block technique needs to be addressed.
15
References
[1] Weinsberg, Y., et al. High performance string matching algorithm for a network intrusion prevention system (nips). 2006: IEEE. [2] Vasiliadis, G., M. Polychronakis, and S. Ioannidis, Parallelization and characterization of pattern matching using GPUs, IEEE. p. 216-225. [3] Tumeo, A., O. Villa, and D. Sciuto. Ecient pattern matching on GPUs for intrusion detection systems. 2010: ACM. [4] Vasiliadis, G., et al., Gnort: High performance network intrusion detection using graphics processors. 2008, Springer. p. 116-134. [5] Huang, N.F., et al., A gpu-based multiple-pattern matching algorithm for network intrusion detection systems. 2008, IEEE. p. 62-67. [6] Dharmapurikar, S. and J. Lockwood. Fast and scalable pattern matching for content ltering. 2005: IEEE. [7] Liu, R.T., et al. A fast pattern matching algorithm for network processor-based intrusion detection system. 2004: IEEE. [8] Taherkhani, M.A. and M. Abbaspour. An ecient hardware architecture for deep packet inspection in hybrid intrusion detection systems. 2009: IEEE. [9] Vasiliadis, G., et al. Regular expression matching on graphics hardware for intrusion detection. 2009: Springer. [10] NVIDIA CUDA, http://www.nvidia.com/object/cuda_home_new.html [11] Kumar, S., et al., Algorithms to accelerate multiple regular expressions matching for deep packet inspection. ACM SIGCOMM Computer Communication Review, 2006. 36(4): p. 339-350.
16

Indian Institute of Technology, Roorkee: Progress Report High Performance Intrusion Detection System

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Indian Institute of Technology, Roorkee: Progress Report High Performance Intrusion Detection System

Hochgeladen von

Copyright:

Verfügbare Formate

Indian Institute of Technology, Roorkee

Progress Report High Performance Intrusion Detection System

Figure 1: CPU and GPU architecture

Issues in High Performance Computing Architechture

Name Reg Ex Epsilon Character Concatenation RS Alternation Kleene star R|S A*

Table 1: Regular expression operations.

Regular Expression and DFA

Figure 2: The DFA state machine

Figure 3: state transition table

Figure 4: Packets are stored to dierent buckets

Based on minimizing the computational cost

Figure 6: Data transfer rate between host and device (Gbit/s).

Classication of HPCA for IDS

Based on minimizing the computation cost

Algorithms for minimizing computational cost over GPU

Converting DFAs to D2 FAs

procedure renedmaxspantree (graph G=(V, W), modies set edge default);

Das könnte Ihnen auch gefallen