Sie sind auf Seite 1von 6

Second International Multisymposium on Computer and Computational Sciences

Network Traffic Classification Using K-means Clustering


Liu Yingqiu, Li Wei, Li Yunchun Network Technology Key Lab of Beijing, School of Computer Science and Engineering, Beihang University, Beijing 100083, China yingqiu.cn@gmail.com

Abstract
Network traffic classification and application identification provide important benefits for IP network engineering, management and control and other key domains. Current popular methods, such as port-based and payload-based, have shown some disadvantages, and the machine learning based method is a potential one. The traffic is classified according to the payload-independent statistical characters. This paper introduces the different levels in network traffic-analysis and the relevant knowledge in machine learning domain, analysis the problems of port-based and payload-based methods in traffic classification. Considering the priority of the machine learning-based method, we experiment with unsupervised K-means to evaluate the efficiency and performance. We adopt feature selection to find an optimal feature set and log transformation to improve the accuracy. The experimental results on different datasets convey that the method can obtain up to 80% overall accuracy, and, after a log transformation, the accuracy is improved to 90% or more.

1. Introduction
At present, the development of the TCP/IP technology based on Internet towards to a depth direction, such as the deployment of new generation infrastructure, the development of new technology; and the emergence of new application patterns and demands. Compared with the rapid development of the Internet, there is little research of network behaviors. Internet not only has the volatile, heterogeneity, dynamic, but also the strong society. The user behavior has an important effect on the Internet. So it is an interesting direction to understand such a systems statistical and dynamic property, and Internet users behavioral character. In addition, research of Internet and users behavior is an important step of many network management tasks. Network traffic analysis is done in order to resolve the above problems. Almost all activities related to the

network are linked to traffic. Network traffic is an important carrier to record and reflect the Internet and user activities; it is also an important composition of network behavior. Through the analysis of network traffic statistics, we can master the network statistical behavior indirectly. With the variety of applications emerging, besides the traditional applications (e.g. http, email, web, and ftp), new applications such as P2P have gained strong momentum. So it will be an interesting work to classify traffic and identify applications. A number of areas, such as trend analysis, and dynamic access control, can benefit from it. At the same time, accurate classification of Internet traffic is an important basis of network security and traffic engineering. Traffic statistics of different application types, reflecting user behavior while using the network, so it can be useful to help network administrators to control traffic such that traffic critical to business is given higher priority service on their network. The remainder of the paper is structured as follows. Section 2 describes related work. Section 3 introduces K-means algorithm used in the paper. In Section 4, we introduce the datasets used in our work and some pre-disposal. Section 5 evaluates experimental results. Section 6 is a conclusion and outlines future work.

2. Related work
At present, the main types of network application include HTTP, P2P, SMTP, POP3, Telnet, DNS, and FTP, etc. This section discusses the level of traffic analysis, and demonstrates which levels we are concerned about. Meanwhile, several techniques presented in the literature are surveyed; such as Port Number Mapping, Payload-based Analysis and Machine Learning. Current research of network traffic analysis mainly focuses on the bit-level, packet-level, flow-level and stream-level. Bit-Levels research mostly concern network traffics quantitative characteristics, such as network cable transmission rate and throughputs

0-7695-3039-7/07 $25.00 2007 IEEE DOI 10.1109/IMSCCS.2007.52

360

changes. Packet-Level cares the arrival procedure of the IP packet, delay and packet loss rate. Ref. [1] studied the change of the backbone network at flow load, round-trip time, packet disorder ratio and delay. The basis of flow partition is the address and protocol. For example, in [2] defines flow as series of packet exchanges between two hosts, identified by a 5-tuple (source IP address, source port, destination IP address, destination port, application protocol). The layer mainly thinks of arrival procedure of flow, inter-arrival time etc local characteristics. Ref. [3] defines stream as a 3-tuple (source IP address, destination address, application protocol). The goal is to focus on statistical characteristics of the long-term flow about backbone network. The granularity of traffic within the above four layers increases from small to large, and the time scale of concern increases gradually. At different time scale, network traffic performs different behavior regularly. In this paper, the level of concern is flow. The goal is to classify different flows and specify their application types.

2.1. Port Number Analysis


The traditional method relies on linking a well-known port number with a specific application, so as to identify different Internet traffic. The port-based method is successful because many well-known applications have specific port numbers (assigned by IANA [4]). For example, HTTP traffic uses port 80; Ftp port 21.But with the emergence of P2P application, the accuracy of port-based is declined sharply. Because such application tries to hide from firewalls and network security tools by using dynamic port numbers, or masquerading as HTTP or FTP applications. So the port-based method is no longer reliable.

2.2. Payload-based Analysis


In order to deal with the disadvantages of the above method, a more reliable technique is to inspect the packet payload [5, 6]. In these methods, payloads are analyzed to determine whether or not they contain characteristic signatures of known applications. This technique can be extremely accurate when the payload is not encrypted. But this assumption is unrealistic because some P2P applications by use of different methods (encryption, variable-length padding), to avoid detecting by this technique. In addition, the demand of high process and storage capacity is discouraged, and privacy is concerned with examining user information.

of modern artificial intelligence. The ability of continually gaining new knowledge or skills, re-organizing knowledge structure to improve their performance, has let it become a widely used method in network traffic classification. The machine learning procedure can be divided into two steps: classification model building and then classification. Machine learning techniques can be divided into the categories of unsupervised and supervised. [7] used a supervised algorithm called Naive Bayes. They take the hand-classified network data as input to Bayes estimator, and achieve a high accuracy on both a per-byte and a per-packet basis. Ref. [8] adopts a nearest neighbor and linear discriminating analysis to classify traffic into different applications by use of up to four attributes. The supervised approach requires the training data to be labeled before the model is built. The goal of those methods is how to improve accuracy of classification. They would not discover new applications. Some clustering Machine Learning techniques, such as in [9] uses AutoClass to group traffic into different clusters, and analyze the best set of attributes by using SFS techniques. [10] used a supervised method called Naive Bayes and unsupervised method called EM respectively. The author takes Total Number of Packets, Mean Packet Size (in each direction and combined), Mean Data Packet Size, Flow Duration, and the Mean Inter-Arrival Time of Packets as discriminators, estimates the results of two methods. In his following work [11], by using two other clustering techniques, called K-means and DBSCAN, to study the advantages of unsupervised techniques. The result demonstrates that unsupervised clustering approach can identify new applications by examining the flows that are grouped to form a new cluster. And labeling all flows in advance is a time consuming and difficult task. Unsupervised techniques don't need hand-labeled traces, they just based on the inner similarity among all flows within a training set to group several clusters. [12] proposed a method that relies on the size of first five packets of a TCP flow to distinguish different applications. The accurate classification of flows based on information contained only in the first few packets, and it requires the first few packets must be arrived in order.

3. K-means algorithms
This section reviews the algorithm used in our work, namely the K-means (unsupervised). This approach takes statistical information as an input vector to build the classification models (or classifiers). The K-means clustering algorithm is a simple but popular analysis method. Its basic idea is that you start with a training set and an assigned number of clusters (k) you want to find. The items within a training set are

2.3. Machine learning-based approaches


Machine Learning is an important research direction

361

assigned to a cluster according to a similarity measurement; distance, for example. We use Euclidean distance to estimate similarity in this paper, which is defined as defined as: Table 1: Basic information of each subset
Data-set Subset1 Subset2 Subset3 Start-time 2003-Aug-20 01:37:37 2003-Aug-20 04:39:10 2003-Aug-20 14:55:44 End-time 2003-Aug-2 0 02:05:54 2003-Aug-2 0 05:09:05 2003-Aug-2 0 15:22:37
1 2

Duration 1696.7 1794.9 1613.4

Flows 23801 21648 65036

traces, only three subsets of the trace are used. The Table 1 demonstrates some basic information about subset trace used in our work. In our study, we only concerned the TCP-based applications; UDP data are left in future work. TCP is a state-protocol. Generally, a complete flow is well defined when the start of a connection using TCPs Three-way Handshake and the tear-down is observed.

4.2. Classification of the data traces


The rude traffic had been hand-classified by Andrew in [14], so we can get the true label of each flow as an evaluation of our methods accuracy. The classes included in these traces are: WWW, MAIL, P2P, FTP-CONTROL, FTP-PASV, ATTACK, DATABASE, FTP-DATA, SERVICES, INTERACTIVE, MULTIMEDIA, and GAMES. When calculating the number of flows belonging to each type of class, the results showed that up to 83.7% of the flows in the trace were WWW traffic. So it is important to balance classes in order to judge fairly the ability of both machine learning techniques to classify all types of traffic. Therefore, we reduce some flows and set the number of flows in the subsets is from 8000 to 20000. Table 2 shows the summary statistics of one of experimental subsets.

n dist ( x, y ) = ( xi yi ) 2 i =1

(1)

The smaller the distance is between two items, the more similar they are. The K-means method tries to find an optimal solution by minimizing the square error, which is defined as:

E = | dist ( x j ,c i ) |2
i =1 j =1

(2)

The square error is computed with the distance squared between each object x and the center of its cluster. Item c represents the center of each cluster. Then the K-means clustering proceeds by repeated application of a two-step process until the member within the clusters no longer change: 1) The mean vector(center) for all items in each cluster is computed; 2) Items are reassigned to the cluster based on the new centers. At last, we get the final partitioning. We consider it as an overall optimal clustering solution because we run the method several times.

4.3. Flow description


In our experiment, flow is described by a series of features or discriminators; this information is as the input vector of learning and classifying. Assume F={F1,F2,Fn} is a set of flows. A flow item is described by a vector of discriminators, Fi= {Fij|1 j m}, where m is the number of discriminators, and Fij is the value of the jth discriminator of the ith flow. The discriminators can be # of packets or flow duration, etc. This can be the feature vector referred to in the machine learning literature. Meanwhile, define L={L1,L2,Lp} as the final application type set, where p is the number of classes. The value of the set item can be WWW or MAIL etc. The goal is to train a classifier and evaluate the accuracy.

4. Experiment setup
This section describes the traces and discriminators that form the basis of our work as well as the analysis tools and feature selection techniques used in the evaluation.

4.4. Feature selection and transformation


Generally, the discriminator set describing a flow can be several hundred. But not all discriminators have the positive effect of discriminating between different types. And the size and the quality of the feature set greatly influence the effectiveness of ML algorithms. So it is important to carefully select the number and type of discriminator to train classifiers.

4.1. Data traces


We use data collected by the high-performance network monitor described in [13]. The site is a research facility hosting about 1,000 users connected to the Internet via a full-duplex Gigabit Ethernet link. Our data used in experiment is based upon a 24-hour, full-duplex trace of this research facility. Due to the large size of the

362

Table 2: Application distribution of one subset


Applications WWW FTP-DATA P2P MAIL ATTACK FTP-PASV DATABASE FTP-CONTROL SERVICES INTERACTIVE GAMES TOTAL # of flows 4058 867 750 741 446 349 292 260 213 23 1 8000 % of Total 50.725% 10.8375% 9.375% 9.2625% 5.575% 4.3625% 3.65% 3.25% 2.6625% 0.2875% 0.0125% 100%

5. Evaluation
This section evaluates the effectiveness of each ML method using different dataset. And give a comparison of accuracy at before and after transforming.

5.1. Evaluation metrics


We use the common method of K-fold cross validation for testing the accuracy of C4.5 and run k-means several times to obtain a better solution. Then we compute two standard performance metrics: 1) Recall (or True Positive rate) is the number of class members classified correctly over the total number of the class members. 2) Overall accuracy is the percentage of correctly classified items over the total number of items.

Feature selection algorithms are widely organized into the filter and wrapper model [15]. And the wrapper method evaluates the effect of different subsets using specific classification algorithms not for clustering methods. So discriminators are biased towards the algorithm used. Because different feature and search techniques may have different preferences, we combined some feature selection and search techniques, such as CFS, Consistency-based subset evaluation, Information gain attribute evaluation for feature selection; backward and forward greedy search, Best First, Ranker for searching. We believe that the more frequent a discriminator appears in the selected feature subsets, the better at discriminating the classes. So, after several experiments with different data sets, we find the better subset of features used in our work: the number of total-packets-b-a, the number of actual-data-bytes-b-a, the number of the pushed-data-pkts-a-b, the number of the pushed-data-pkts-b-a, size of the mean-IPpacket-a-b, size of the max-IPpacket-a-b, variant of the IPpacketsize-a-b, size of the mean-IPpacket-b-a, size of the max-IPpacket-b-a, variant of the IPpacketsize-b-a, duration, where a is client and b is server. In order to get higher classification accuracies, we usually adopt some transformation such as Log or Gaussian, etc. Through the transformation step, we can dispose of noise or abnormal data. In our work, we use the Log transformation to transform the flow discriminators, because the training set obeys Normal Distribution after the Log transformation. As we all know, if the data belongs to Normal distribution, we can obtain a better analysis and forecast result.

5.2. Comparison of K-means with log and original data


For K-means approach, it begins with the assigned value k, and partitions the dataset into k disjoint clusters. In our experiments, we expect there would be at least one cluster for each application type. In addition, due to the diversity of the traffic in some classes, we expect even more clusters to be formed for some types. In order to obtain a suitable K, we experiment with k from 20 to 200. Because the initial cluster assignment is random, different runs of the K-means clustering algorithm may not give the same final clustering solution. So in order to obtain a overall optimal solution, the K-means method is run many times, and the solution with the smallest square error is saved. The overall accuracy with the change of K is shown in Figure 1. The Figure 1 shows that the overall accuracy gradually improves as the number of clusters increases. But when the k is greater than 150, there is only a little improvement. And it is obvious that the overall accuracy improves at least 10% after applying log transformation. When the k is 80, the overall accuracy has been up to 90%. This result is very significant, because the greater the k is, the more time it takes to form model. And big values of K increase the probability of over-fitting. The examination of recall about each type between the log and non-log data can be seen in Figure 2. They experiment with the same dataset, and the following picture displays a comparison between each others maximum performance.

4.5. Analysis tools


In our work, experiments were conducted using version 3.5.3 of the WEKA (Waikato Environment for knowledge Analysis) software suite [16]. It is widely used in the Machine Learning community and implemented by Java language.

363

1 0.9 Overall accuracy(%) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 50 80 100 130 150 Number of Clusters(K) original data log data 180 200

Figure 1. Overall accuracy for log and original data


no n-l og 1 0. 9 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0
AT D TA F T AT A C K P- B C O AS N E F T TR O P- L F T D AT P- A PA SV IN TE GA RA ME CT S IV E MA IL SE P R V 2P IC ES WW W

l og

Re cal l(% )

be obtained only from the packet headers as the input vector; therefore, the use of payload is avoided. The methods runtime required to build the models is a notable question. All of our experiments are performed on a node of cluster with Intel Xeon(TM) 3.0GHz processor and 2GB of RAM. The runtime of the K-means on non-log data is about 2 hours or more in order to obtain a higher accuracy1. The runtime on log data with the same parameters is a little lower, but the overall accuracy improves up to 10%. We believe Log transformation is a very useful step of classifying traffic. Compare with some supervised methods used in traffic classification, the unsupervised K-means possesses some advantages because it does not require the training data to be labeled in advance, so new applications or previously unseen flows can be identified by grouping an independent cluster. And because the clusters produced by K-means nearly correspond to one application, selectively identify several flows within a cluster so as to map cluster, to specific application, will be a significant time savings.

6. Conclusions and future work


As an important character of network application, there is much literature about network traffic. In this paper, from the micro perspective of network traffic, we described the use of supervised and unsupervised machine-Learning to classify network traffic by application. We compared overall accuracy before and after log transformation of data, and it proved that the accuracy can be improved at least 10% after log transformation. We showed that the K-means performed well at traffic classification, with an accuracy of 90%. It is a promising result because the unsupervised K-means didnt require a training set to be hand-classified in advance. Thus, new applications can be identified by grouping a separate cluster. Considering the dynamic of network, we think unsupervised method will be promising. Our immediate next step is to improve the K-means method; let these begin with assigned initiate objects rather than random start. Thus, it can sharply reduce training time. Then we want to apply the k-means to real-time classification. We hope to add a tab in the flying traffic after we have identified its application types, and apply appropriate control to certain types. Another interesting question is how stable classifications are over time and the frequency of updating the classifiers.

Ap pli cat ion Ty pes

Figure 2. Comparison of Recall

Through the analysis of the result, we find that after log transformation, the recall value of each type increases a lot, for 7 of 11classes were above 80%, where six of them exceed 90%.However, the recall of non-log data only for 2 of 11 classes was above 90%. Most of them are under 70% or lower. Regarding with SERVICES and INTERACTIVE, K-means performs on log data far better than non-log with recall of 81.8182% and 90.566% compare with 4.5455% and 0%, respectively. We note that for GAMES flows, the recalls of log and non-log data are all 0%; this is because we only have one flow of GAMES, and the method cant learn more information about this type traffic. In conclusion, we believe Log transformation is a very useful step of classifying traffic, we can obtain a higher overall accuracy and recall value after applying log transformation, and the procedure of log transformation is simple and non-time consuming.

5.3 Discussion
Through the above experimental analysis, we observed that the method performed well at classifying the flows, especially for log transformation data. The approach just took the statistical information which can

7. Acknowledgement
This work was supported by the National Education Council of China Beijing (The research of key technologies of Grid construction: JD100060630). We

364

thanks for providing a good experiment environment and some useful comments.

8. References
[1] C. Fraleigh, S. Moon, B. Lyles, C. Cotton, M. Khan, D. Moll, et al. Packet-level traffic measurements form the sprint IP backbone, IEEE Trans. On Networks, 2003, 17(6): 6-16. [2] C. Barakat, P. Thiran, G. Iannaccone, C. Diot, P. Owezarski, Modeling Internet backbone traffic at the flow level, IEEE Trans. on Signal Processing Special Issue On Networking, 2003, 51(8):2111-2124. [3] t. He, h. Zhang, z. Li, A methodology for analyzing backbone network traffic at stream-level, Communication Technology Proceedings, 1, 2003, PP. 98-102. [4] IANA, Internet Assigned Numbers Authority, http://www.iana.org/assignments/port-numbers. [5] C. Dews, A. Wichmann, A. Feldmann, An analysis of internet chat systems, In IMC03, New York: ACM Press, 2003:51-64. [6] P. Haffner, S. Sen, O. Spatscheck, D. Wang, ACAS: Automated Construction of Application Signatures, SIGCOMM05 MineNet Workshop, New York: ACM Press, 2005, PP. 197-202. [7] A. Moore, D. Zuev, Internet Traffic Classification Using Bayesian Analysis Techniques, SIGMETRICS05, New York: ACM Press, 2005, PP. 50-60. [8] M.Roughan, S. Sen, O. Spatscheck, N. Duffield. Class-of-Service Mapping for QoS: A Statistical Signature-based Approach to IP Traffic Classification, IMC04, Italy: Taormina, 2004:5-27. [9] S. Zander, H. Nguyen, G. Armitage, Automated Traffic Classification and Application Identification Using Machine Learning, Proceedings of The IEEE Conference on Local Computer Networks 30th Anniversary, Washington: IEEE Computer Society, 2005, pp. 250-257. [10] J. Erman. M. Arlitt, M. Anirban. Internet Traffic Identification using Machine Learning Techniques: Proc. of 49th IEEE Global Telecommunications Conference, 2006. [11] J. Erman. M. Arlitt, M. Anirban, Traffic Classification Using Clustering Algorithms, Proceedings of the 2006 SIGCOMM Workshop on Mining Network Data, New York: ACM Press, 2006, pp. 11-15. [12] L. Bernaille, R. Teixeira, I. Akodjenou, Traffic Classification on The Fly, ACM SIGCOMM Computer Communication Review, New York: ACM Press, 2006:23-26. [13] A.W. Moore, J. Hall, C. Kreibich, E. Harris, I. Pratt, Architecture of a Network Monitor, Passive & Active Measurement Workshop, 2003. [S.1.]:[s.n.], 2003 [14] A.W. Moore, D. Zuev, Discriminators for use in flow-based classification, Cambridge, Intel Research, 2005 [15] M.A. Hall, G. Holmes, Benchmarking attribute selection techniques for discrete class data mining, IEEE Transactions on Knowledge & Data Engineering, 2003. [16] I.H. Witten, E. Frank, Data Mining: practical machine learning tools and techniques, San Francisco: Morgan Kaufmann.2005.

365

Das könnte Ihnen auch gefallen