Beruflich Dokumente
Kultur Dokumente
Abstract
Network traffic classification and application identification provide important benefits for IP network engineering, management and control and other key domains. Current popular methods, such as port-based and payload-based, have shown some disadvantages, and the machine learning based method is a potential one. The traffic is classified according to the payload-independent statistical characters. This paper introduces the different levels in network traffic-analysis and the relevant knowledge in machine learning domain, analysis the problems of port-based and payload-based methods in traffic classification. Considering the priority of the machine learning-based method, we experiment with unsupervised K-means to evaluate the efficiency and performance. We adopt feature selection to find an optimal feature set and log transformation to improve the accuracy. The experimental results on different datasets convey that the method can obtain up to 80% overall accuracy, and, after a log transformation, the accuracy is improved to 90% or more.
1. Introduction
At present, the development of the TCP/IP technology based on Internet towards to a depth direction, such as the deployment of new generation infrastructure, the development of new technology; and the emergence of new application patterns and demands. Compared with the rapid development of the Internet, there is little research of network behaviors. Internet not only has the volatile, heterogeneity, dynamic, but also the strong society. The user behavior has an important effect on the Internet. So it is an interesting direction to understand such a systems statistical and dynamic property, and Internet users behavioral character. In addition, research of Internet and users behavior is an important step of many network management tasks. Network traffic analysis is done in order to resolve the above problems. Almost all activities related to the
network are linked to traffic. Network traffic is an important carrier to record and reflect the Internet and user activities; it is also an important composition of network behavior. Through the analysis of network traffic statistics, we can master the network statistical behavior indirectly. With the variety of applications emerging, besides the traditional applications (e.g. http, email, web, and ftp), new applications such as P2P have gained strong momentum. So it will be an interesting work to classify traffic and identify applications. A number of areas, such as trend analysis, and dynamic access control, can benefit from it. At the same time, accurate classification of Internet traffic is an important basis of network security and traffic engineering. Traffic statistics of different application types, reflecting user behavior while using the network, so it can be useful to help network administrators to control traffic such that traffic critical to business is given higher priority service on their network. The remainder of the paper is structured as follows. Section 2 describes related work. Section 3 introduces K-means algorithm used in the paper. In Section 4, we introduce the datasets used in our work and some pre-disposal. Section 5 evaluates experimental results. Section 6 is a conclusion and outlines future work.
2. Related work
At present, the main types of network application include HTTP, P2P, SMTP, POP3, Telnet, DNS, and FTP, etc. This section discusses the level of traffic analysis, and demonstrates which levels we are concerned about. Meanwhile, several techniques presented in the literature are surveyed; such as Port Number Mapping, Payload-based Analysis and Machine Learning. Current research of network traffic analysis mainly focuses on the bit-level, packet-level, flow-level and stream-level. Bit-Levels research mostly concern network traffics quantitative characteristics, such as network cable transmission rate and throughputs
360
changes. Packet-Level cares the arrival procedure of the IP packet, delay and packet loss rate. Ref. [1] studied the change of the backbone network at flow load, round-trip time, packet disorder ratio and delay. The basis of flow partition is the address and protocol. For example, in [2] defines flow as series of packet exchanges between two hosts, identified by a 5-tuple (source IP address, source port, destination IP address, destination port, application protocol). The layer mainly thinks of arrival procedure of flow, inter-arrival time etc local characteristics. Ref. [3] defines stream as a 3-tuple (source IP address, destination address, application protocol). The goal is to focus on statistical characteristics of the long-term flow about backbone network. The granularity of traffic within the above four layers increases from small to large, and the time scale of concern increases gradually. At different time scale, network traffic performs different behavior regularly. In this paper, the level of concern is flow. The goal is to classify different flows and specify their application types.
of modern artificial intelligence. The ability of continually gaining new knowledge or skills, re-organizing knowledge structure to improve their performance, has let it become a widely used method in network traffic classification. The machine learning procedure can be divided into two steps: classification model building and then classification. Machine learning techniques can be divided into the categories of unsupervised and supervised. [7] used a supervised algorithm called Naive Bayes. They take the hand-classified network data as input to Bayes estimator, and achieve a high accuracy on both a per-byte and a per-packet basis. Ref. [8] adopts a nearest neighbor and linear discriminating analysis to classify traffic into different applications by use of up to four attributes. The supervised approach requires the training data to be labeled before the model is built. The goal of those methods is how to improve accuracy of classification. They would not discover new applications. Some clustering Machine Learning techniques, such as in [9] uses AutoClass to group traffic into different clusters, and analyze the best set of attributes by using SFS techniques. [10] used a supervised method called Naive Bayes and unsupervised method called EM respectively. The author takes Total Number of Packets, Mean Packet Size (in each direction and combined), Mean Data Packet Size, Flow Duration, and the Mean Inter-Arrival Time of Packets as discriminators, estimates the results of two methods. In his following work [11], by using two other clustering techniques, called K-means and DBSCAN, to study the advantages of unsupervised techniques. The result demonstrates that unsupervised clustering approach can identify new applications by examining the flows that are grouped to form a new cluster. And labeling all flows in advance is a time consuming and difficult task. Unsupervised techniques don't need hand-labeled traces, they just based on the inner similarity among all flows within a training set to group several clusters. [12] proposed a method that relies on the size of first five packets of a TCP flow to distinguish different applications. The accurate classification of flows based on information contained only in the first few packets, and it requires the first few packets must be arrived in order.
3. K-means algorithms
This section reviews the algorithm used in our work, namely the K-means (unsupervised). This approach takes statistical information as an input vector to build the classification models (or classifiers). The K-means clustering algorithm is a simple but popular analysis method. Its basic idea is that you start with a training set and an assigned number of clusters (k) you want to find. The items within a training set are
361
assigned to a cluster according to a similarity measurement; distance, for example. We use Euclidean distance to estimate similarity in this paper, which is defined as defined as: Table 1: Basic information of each subset
Data-set Subset1 Subset2 Subset3 Start-time 2003-Aug-20 01:37:37 2003-Aug-20 04:39:10 2003-Aug-20 14:55:44 End-time 2003-Aug-2 0 02:05:54 2003-Aug-2 0 05:09:05 2003-Aug-2 0 15:22:37
1 2
traces, only three subsets of the trace are used. The Table 1 demonstrates some basic information about subset trace used in our work. In our study, we only concerned the TCP-based applications; UDP data are left in future work. TCP is a state-protocol. Generally, a complete flow is well defined when the start of a connection using TCPs Three-way Handshake and the tear-down is observed.
n dist ( x, y ) = ( xi yi ) 2 i =1
(1)
The smaller the distance is between two items, the more similar they are. The K-means method tries to find an optimal solution by minimizing the square error, which is defined as:
E = | dist ( x j ,c i ) |2
i =1 j =1
(2)
The square error is computed with the distance squared between each object x and the center of its cluster. Item c represents the center of each cluster. Then the K-means clustering proceeds by repeated application of a two-step process until the member within the clusters no longer change: 1) The mean vector(center) for all items in each cluster is computed; 2) Items are reassigned to the cluster based on the new centers. At last, we get the final partitioning. We consider it as an overall optimal clustering solution because we run the method several times.
4. Experiment setup
This section describes the traces and discriminators that form the basis of our work as well as the analysis tools and feature selection techniques used in the evaluation.
362
5. Evaluation
This section evaluates the effectiveness of each ML method using different dataset. And give a comparison of accuracy at before and after transforming.
Feature selection algorithms are widely organized into the filter and wrapper model [15]. And the wrapper method evaluates the effect of different subsets using specific classification algorithms not for clustering methods. So discriminators are biased towards the algorithm used. Because different feature and search techniques may have different preferences, we combined some feature selection and search techniques, such as CFS, Consistency-based subset evaluation, Information gain attribute evaluation for feature selection; backward and forward greedy search, Best First, Ranker for searching. We believe that the more frequent a discriminator appears in the selected feature subsets, the better at discriminating the classes. So, after several experiments with different data sets, we find the better subset of features used in our work: the number of total-packets-b-a, the number of actual-data-bytes-b-a, the number of the pushed-data-pkts-a-b, the number of the pushed-data-pkts-b-a, size of the mean-IPpacket-a-b, size of the max-IPpacket-a-b, variant of the IPpacketsize-a-b, size of the mean-IPpacket-b-a, size of the max-IPpacket-b-a, variant of the IPpacketsize-b-a, duration, where a is client and b is server. In order to get higher classification accuracies, we usually adopt some transformation such as Log or Gaussian, etc. Through the transformation step, we can dispose of noise or abnormal data. In our work, we use the Log transformation to transform the flow discriminators, because the training set obeys Normal Distribution after the Log transformation. As we all know, if the data belongs to Normal distribution, we can obtain a better analysis and forecast result.
363
1 0.9 Overall accuracy(%) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 50 80 100 130 150 Number of Clusters(K) original data log data 180 200
l og
Re cal l(% )
be obtained only from the packet headers as the input vector; therefore, the use of payload is avoided. The methods runtime required to build the models is a notable question. All of our experiments are performed on a node of cluster with Intel Xeon(TM) 3.0GHz processor and 2GB of RAM. The runtime of the K-means on non-log data is about 2 hours or more in order to obtain a higher accuracy1. The runtime on log data with the same parameters is a little lower, but the overall accuracy improves up to 10%. We believe Log transformation is a very useful step of classifying traffic. Compare with some supervised methods used in traffic classification, the unsupervised K-means possesses some advantages because it does not require the training data to be labeled in advance, so new applications or previously unseen flows can be identified by grouping an independent cluster. And because the clusters produced by K-means nearly correspond to one application, selectively identify several flows within a cluster so as to map cluster, to specific application, will be a significant time savings.
Through the analysis of the result, we find that after log transformation, the recall value of each type increases a lot, for 7 of 11classes were above 80%, where six of them exceed 90%.However, the recall of non-log data only for 2 of 11 classes was above 90%. Most of them are under 70% or lower. Regarding with SERVICES and INTERACTIVE, K-means performs on log data far better than non-log with recall of 81.8182% and 90.566% compare with 4.5455% and 0%, respectively. We note that for GAMES flows, the recalls of log and non-log data are all 0%; this is because we only have one flow of GAMES, and the method cant learn more information about this type traffic. In conclusion, we believe Log transformation is a very useful step of classifying traffic, we can obtain a higher overall accuracy and recall value after applying log transformation, and the procedure of log transformation is simple and non-time consuming.
5.3 Discussion
Through the above experimental analysis, we observed that the method performed well at classifying the flows, especially for log transformation data. The approach just took the statistical information which can
7. Acknowledgement
This work was supported by the National Education Council of China Beijing (The research of key technologies of Grid construction: JD100060630). We
364
thanks for providing a good experiment environment and some useful comments.
8. References
[1] C. Fraleigh, S. Moon, B. Lyles, C. Cotton, M. Khan, D. Moll, et al. Packet-level traffic measurements form the sprint IP backbone, IEEE Trans. On Networks, 2003, 17(6): 6-16. [2] C. Barakat, P. Thiran, G. Iannaccone, C. Diot, P. Owezarski, Modeling Internet backbone traffic at the flow level, IEEE Trans. on Signal Processing Special Issue On Networking, 2003, 51(8):2111-2124. [3] t. He, h. Zhang, z. Li, A methodology for analyzing backbone network traffic at stream-level, Communication Technology Proceedings, 1, 2003, PP. 98-102. [4] IANA, Internet Assigned Numbers Authority, http://www.iana.org/assignments/port-numbers. [5] C. Dews, A. Wichmann, A. Feldmann, An analysis of internet chat systems, In IMC03, New York: ACM Press, 2003:51-64. [6] P. Haffner, S. Sen, O. Spatscheck, D. Wang, ACAS: Automated Construction of Application Signatures, SIGCOMM05 MineNet Workshop, New York: ACM Press, 2005, PP. 197-202. [7] A. Moore, D. Zuev, Internet Traffic Classification Using Bayesian Analysis Techniques, SIGMETRICS05, New York: ACM Press, 2005, PP. 50-60. [8] M.Roughan, S. Sen, O. Spatscheck, N. Duffield. Class-of-Service Mapping for QoS: A Statistical Signature-based Approach to IP Traffic Classification, IMC04, Italy: Taormina, 2004:5-27. [9] S. Zander, H. Nguyen, G. Armitage, Automated Traffic Classification and Application Identification Using Machine Learning, Proceedings of The IEEE Conference on Local Computer Networks 30th Anniversary, Washington: IEEE Computer Society, 2005, pp. 250-257. [10] J. Erman. M. Arlitt, M. Anirban. Internet Traffic Identification using Machine Learning Techniques: Proc. of 49th IEEE Global Telecommunications Conference, 2006. [11] J. Erman. M. Arlitt, M. Anirban, Traffic Classification Using Clustering Algorithms, Proceedings of the 2006 SIGCOMM Workshop on Mining Network Data, New York: ACM Press, 2006, pp. 11-15. [12] L. Bernaille, R. Teixeira, I. Akodjenou, Traffic Classification on The Fly, ACM SIGCOMM Computer Communication Review, New York: ACM Press, 2006:23-26. [13] A.W. Moore, J. Hall, C. Kreibich, E. Harris, I. Pratt, Architecture of a Network Monitor, Passive & Active Measurement Workshop, 2003. [S.1.]:[s.n.], 2003 [14] A.W. Moore, D. Zuev, Discriminators for use in flow-based classification, Cambridge, Intel Research, 2005 [15] M.A. Hall, G. Holmes, Benchmarking attribute selection techniques for discrete class data mining, IEEE Transactions on Knowledge & Data Engineering, 2003. [16] I.H. Witten, E. Frank, Data Mining: practical machine learning tools and techniques, San Francisco: Morgan Kaufmann.2005.
365