Beruflich Dokumente
Kultur Dokumente
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2735996, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA, VOL. 12, NO. 6, Aug 2017 1
Abstract—Network traffic classification plays a significant role With the emergence of more and more new applications,
in cyber security applications and management scenarios. Con- the traditional traffic classification techniques are facing sig-
ventional statistical classification techniques rely on the assump- nificant challenges. The classic technique identifies the origin
tion that clean labelled samples are available for building classifi-
cation models. However, in the big data era, mislabelled training application of network traffic according to the port number
data commonly exist due to the introduction of new applications in the packet header, e.g., the port 80 is associated with
and lack of knowledge. Existing statistical traffic classification the HTTP application. The assumption is that each network
techniques do not address the problem of mislabelled training application use a distinct port number assigned by IANA.
data, so their performance become poor in the presence of Unfortunately, many applications are using dynamic ports and
mislabelled training data. To meet this challenge, in this paper,
we propose a new scheme, Noise-resistant Statistical Traffic even other applications’ port numbers for certain reasons,
Classification (NSTC), which incorporates the techniques of noise which makes the port-based technique ineffective. To address
elimination and reliability estimation into traffic classification. these problems, the payload-based technique is proposed to
NSTC estimates the reliability of the remaining training data apply deep packet inspection (DPI) to identify the applications
before it builds a robust traffic classifier. Through a number of according to specific content patterns in the payload of IP
traffic classification experiments on two real-world traffic data
sets, the results show that the new NSTC scheme can effectively packets, called application signatures [4]. Today most business
address the problem of mislabelled training data. Compared with systems apply the payload-based technique to classify network
the state of the art methods, NSTC can significantly improve the traffic. However, the payload-based technique cannot deal with
classification performance in the context of big unclean data. the applications with encrypted payload and has a privacy issue
Index Terms—Traffic classification, cyber security, machine caused by DPI. Recently, the research community focuses
learning. on the statistical traffic classification technique that does not
inspect the content of packet payload. This technique extracts
a set of statistical features from traffic flows and employs
I. I NTRODUCTION
machine learning (ML) [6] for application identification. In
AFFIC classification is a fundamental tool for modern
T cyber security management [1]. For example, network
administrators can apply traffic classification technologies to
a feature space, a traffic class consists of all traffic flows
generated by an application (or a type of applications), and
traffic classification becomes a classic multi-class classification
obtain the current network status, in particular the critical problem.
applications, services and user behaviors such as daily usage, Considering the real-world scenario of big traffic data, this
anomaly behaviors and so on. Traffic classification is usually paper addresses a new problem of mislabelled training samples
used to achieve the quality of service (QoS), i.e., various in the area of statistical traffic classification. With more new
applications are assigned to different priorities with appro- applications emerging in our daily life, the composition of
priate levels of Internet resource. For cyber security, traffic network traffic becomes more complex than ever. In practice,
classification can aid to quickly detect network intrusions [2] due to carelessness or lack of knowledge, mislabelled samples
such as Denial of Service attacks [3]. In the last decade, the will be present in the training data. The existing methods
technology of traffic classification has drawn increasing atten- do not consider the presence of such noisy data, so their
tions of academia and practitioners. CISCO has incorporated classification performance is severely compromised. The major
the traffic classification technology into its recent network contributions of this work are:
devices. The number of published research papers increases
• We develop a new system, Noise-resistant Statistical
dramatically after 2005 [5].
Traffic Classification (NSTC), to address the problem of
Manuscript received * *, *; revised * *, *. This work was supported mislabelled training samples.
by the National Natural Science Foundation of China (No. 61401371). • We propose a noise tolerant method to filter the noisy
(Corresponding author: Zili Zhang and Jun Zhang.)
B. Wang is with the College of Computer and Information Science &
training samples so that the reliable training samples will
College of Software, Southwest University, Chongqing 400715, China. E- be kept for training traffic classifiers.
mail: wbf sm1989@163.com • We present a mathematical proof to justify our approach
J. Zhang, L. Pan, and Y. Xiang are with the School of Information Technol- and set up experiments for performance evaluation.
ogy, Deakin University, Geelong, VIC 3216, Australia. E-mail:{jun.zhang,
l.pan, yang }@deakin.edu.au. Performance evaluation of the NSTC scheme is carried out
Z. Zhang is with the College of Computer and Information Science & on two real-world Internet traffic data sets. The results show
College of Software, Southwest University, Chongqing 400715, China, and the
School of Information Technology, Deakin University, Geelong, VIC 3216, that NSTC significantly outperforms the state of the art traffic
Australia. E-mail: zhangzl@swu.edu.cn. classification methods in the context of big unclean data.
D. Xia is with the School of Information Engineering, Guizhou Minzu
University, Guiyang 550025, China. E-mail: gzmy xdw1982@163.com. The remainder of the paper is structured as follows. Sec-
2332-7790 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2735996, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA, VOL. 12, NO. 6, Aug 2017 2
tion II reviews related works of statistical traffic classification. Zhang et al. [37] proposed a hybrid feature selection approach
Section III presents the details of noise-resistant statistical filtering most of features with WSU metric and using a wrap-
traffic classification. Section IV reports the experiments and per method to select features for a specific classifier with Area
results. Finally, Section V concludes this paper. Under ROC Curve (AUC) metric. Furthermore, to overcome
the impact of dynamic traffic flows on feature selection, they
II. RELATED W ORK proposed an “Selects the Robust and Stable Features (SRSF)”
algorithm based on the results achieved by WSU AUC. Am-
This section presents a review on statistical traffic classifi- busaidi et al . [38] demonstrated that the removal of redundant
cation that can address the problems including dynamic ports, features significantly improved the intrusion detection rate. To
encrypted applications and protection of user privacy. identify both optimal and stable features, Fahad et al. [39]
The supervised traffic classification methods produce a proposed an Global Optimization Approach (GOA), relying
decision-making model employing supervised training data. on multi-criterion fusion-based feature selection technique and
And, the supervised training data [7] are labelled according an information theoretic method. To characterise application
to different applications ahead of time. A classifier is trained behaviour at the early stage, Huang et al. [40] developed
in the feature space using the training data set and applied to the statistical attributes of the first few application interaction
classify new network traffic. rounds for each flow from an application layer perspective and
Many classical supervised algorithms have been applied to proposed an “APPlication Round method (APPR)” algorithm
identify various network applications. These methods gen- to identify network application traffic.
erally use sufficient supervised training data. In some early To reduce the congestion and the high client-to-relay ratio,
work, Na¨ıve Bayes techniques [8] with kernel estimation and Al Sabah et al. [19] proposed to define classes of service for
fast correlation-based filter were applied to leverage statistical Tor’s traffic and map each application class to its appropriate
features to address the problems that payload-based traffic QoS requirement. When the characteristics of the network
classification suffer. To solve the problem of requiring full traffic change, the accuracy of classification will degrade.
packet/payloads for classification, Auld and Moore [9] em- Wang et al. [20] proposed an adjustable traffic classification
ployed the Bayesian neural network technique and used the system using the ensemble classification technique and a
features derived from packet header to obtain a high accuracy. change detection method to improve accuracy with relatively
Later, Este et al. [10] presented an approach that applied shorter updating time. Grimaudo et al. [21] pushed forward the
support vector machine (SVM) techniques to solve multi-class adoption of behavioral classifiers by engineering a hierarchical
traffic classification and developed an optimization algorithm classifier that allows proper classification of network traffic
under the circumstance of little training samples. into more than twenty fine grained classes. Nguyen et al. [22]
For the problem of real-time classification, Hulla´r et al. [11] used several statistics derived from sub-flows to achieve au-
used only the first few bytes of the first few packets and tomated QoS management and augmented training data sets
employed a Markov model to recognize Peer-to-Peer (P2P) to maintain timely and continuous traffic classification. Jaber
applications. Nguyen et al. [12] combined short sub-flows et al. [23] proposed a new online method that combines the
from the last N packets to improve the performance of real- statistical and host-based approaches in order to construct a
time classification. Bermolen et al. [13] proposed a new robust and precise method for early Internet traffic identifica-
methodology based on SVM to effectively identify P2P tion.
streaming applications in short time. To address the problem To mitigate the problem that unlabelled traffic samples in
that supervised methods are sensitive to the size of training the training data set affect the classification performance, some
data, Zhang et al. [14] incorporated correlated information and methods were proposed to automatically label the training
Bayes theory into the classification process. Glata et al. [15] data. Erman et al. [24] combined unsupervised and supervised
proposed a classification scheme that does not require the methods for identifying applications. They first employed a
training procedure. Their results show that the main sources clustering algorithm to partition a training data set that is
of one-way traffic derive from malicious scanning, peer-to- composed of scarce labeled flows and abundant unlabelled
peer applications and outages. Jin et al. [16] developed a flows. Second, they used the available labeled flows to obtain
lightweight modular architecture combining a couple of linear a mapping from the clusters to the different known classes.
binary classifiers to improve the classification performance. Li et al. [25] applied a semi-supervised SVM method to
Callado et al. [17] introduced a new set of methodologies for identify applications of network traffic. Their method only
generic combination of traffic identification and provided a required a few labeled samples and improved the classification
recommendation for using the combination algorithms. Xie et performance. Zhang et al. [5] proposed to extract the unknown
al. [18] adapted the subspace clustering to identify the traffic traffic samples from mass network traffic, and built a robust
of each application in isolation and improved the performance traffic classifier. Their experiments show reduced impact of
of one-class classification. unknown application in the process of classification. Wang
To reduce the redundant features, Liu et al. [36] pro- et al. [34] proposed to combine flow clustering based on
posed a class-oriented feature selection approach combing the application signatures. To solve the issue of mapping from
proposed local metric and the existing global metric. The flow clusters to real applications, Erman et al. [35] integrated
approach applies the weighted symmetric uncertainty strategy a set of supervised training data with unsupervised learning.
to removing the redundant features in each feature subset. However, it was still difficult to map a great number of clusters
2332-7790 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2735996, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA, VOL. 12, NO. 6, Aug 2017 3
2332-7790 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2735996, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA, VOL. 12, NO. 6, Aug 2017 4
4: for j = 1 to T||i do
|| p2 p1 p0
0.06
5: Take a sample x j from Ti
6: Set L = 0;
7: for classifier l = 1 to n do 0.04
2332-7790 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2735996, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA, VOL. 12, NO. 6, Aug 2017 5
The goal of this analysis is to show that we can reduce the (µ1 − µ2) units. Therefore, α∃ that satisfies the following two
classification error rate by removing the mislabelled training equations,
data. According to the Bayesian theory, for the distributions of α − x2 = µ1 − µ 2 , (16)
p0(x) and p1(x), the optimized decision point is at x1, where
p (x ) = p (x ). (6) p 1(α) = p2(x2). (17)
0 1 1 1
When x < µ0, p0(x) is a monotonically increasing function.
The probability of the minimum classification error is
In this case, both x1 and x2 are left to µ0, so that x1 < µ0 and
∫ x1 +∞ ∫ x2 < µ0 hold. According to Equation (12), x1 > x2, we can
= t dt p1( t) dt (7)
Pe1 p0( ) + obtain
−∞ x1
. Σ . Σ p0(x1) > p0(x2). (18)
x1 − µ 0 x1 − µ1
= 1+Φ −Φ ,
σ0 σ From Equations (6), (9) and (18), we get
where Φ(x) is the cumulative distribution function (CDF) of
the standard normal distribution,
∫ ,
p1(x1) > p2(x2). (19)
1 x −t 2 2 Based on Equations (17) and (19), we have
Φ(x) = √ e dt. (8)
2π −∞
Similarly, for the distributions of p0(x) and p2(x), the opti- p1(x1) > p1(α). (20)
mized decision point is at x2, where
As shown in Fig. 2, x1 and α are right to µ1, i.e., x1 > µ1 and
p0(x2) = p2(x2). (9) α > µ1. When x > µ1, p1(x) is a monotonically decreasing
function. As a result, we obtain
The probability of the minimum classification error is
. Σ . Σ x1 < α. (21)
x2 − µ0 x2 − µ 2
Pe2 = 1 + Φ −Φ . (10)
σ0 σ From Equations (16) and (21), we get
The difference between
Σ Pe1
. and Pe2 Σis . ΣΣ
x1 − µ 0 x2 − µ0 x1 − x2 < µ1 − µ 2 . (22)
Pe1 − Pe2 = Φ −Φ (11)
We then reorganize Equation (22) to obtain
σ0 σ0
Σ . Σ . ΣΣ x2 − µ2 > x1 − µ 1 . (23)
+ Φ x2 − µ2 − Φ x1 − µ1 .
σ σ Because σ > 0, we have
We would like to show that the difference between Pe1 and x2 − µ2 x1 − µ1
> . (24)
Pe2 is positive. We can see that (Pe1 − Pe2) has two components σ σ
according to Equation (11) . We then need to show that each
Since Φ(x) is a monotonically increasing function, we obtain
part is positive. . Σ . Σ
Let us check the first part of (Pe1 − Pe2). In the case study, x2 − µ2 x1 − µ 1
Φ −Φ > 0. (25)
the graph of p1(x) is closer to p0(x) than p2(x), as shown in σ σ
Fig. 2, so we can obtain the relative position of two decision
points, i.e., Now, we have proven that the second part of (Pe1 − Pe2) is
x1 > x 2 . (12) positive.
Finally, from Equations (11), (15) and (25), we obtain
Considering σ0 > 0, we can have
x1 − µ0 x2 − µ0 . Pe1 − Pe2 > 0, (26)
> (13)
σ0 σ0 that is,
Since the CDF, Φ(x), is a monotonically increasing function, Pe2 < Pe1. (27)
we obtain . Σ . Σ
It means that the classification error of using NSTC is less
x1 − µ0 > x 2 − µ0 . (14)
Φ Φ than that of the original classification method. The reason is
σ0 σ0
that NSTC can address the problem of mislabelled training
The above formula is equivalent to data.
. Σ . Σ
x1 − µ 0 x 2 − µ0
Φ −Φ > 0. (15)
σ0 σ0 IV. PERFORMANCE EVALUATION
That is, we have proven that the first part of (Pe1 − Pe2) is
To evaluate the correctness of the NSTC method, we choose
positive.
to use two network traffic data sets for experiments and
Now, we work on the second part of (Pe1− Pe2). In this
compare the performance of the NSTC method and other
case, as shown in Fig. 2, the graph of p2(x) can be treated as
traffic classification methods. Our results show that the NSTC
the horizontal translation of the graph of p1(x) to the left with
method is resilient to mislabelled data.
2332-7790 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2735996, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA, VOL. 12, NO. 6, Aug 2017 6
BT SMTP
SSL3 SSH
15.66% DNS 15.16%
15.66% POP3 15.16%
4.17%
SMTP 15.16%
15.66% SSL3
SSH MSN 15.16%
15.66% SMALL 2.73%
2.95% IMAP XMPP
RAZOR FTP 1.53%
3.01%
1.91% POP3 HTTP BT
HTTP 1.59%
9.89% IMAP 15.66% 15.16% FTP DNS 15.16%
1.17%
0.59% 1.18%
(a) ToN data set (b) ISP data set
Fig. 3. Network class distributions of the two data sets
2332-7790 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2735996, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA, VOL. 12, NO. 6, Aug 2017 7
100 100
SVM NN RF NSTC SVM NN RF NSTC
95 95
90 90
85 85
Overall Accuracy (%)
75 75
70 70
65 65
60 60
55 55
50 50
45 45
20 25 30 35 40 45 50 20 25 30 35 40 45 50
100 100
90 90
85 85
Overall Accuracy (%)
75 75
70 70
65 65
60 60
55 55
50 50
45 45
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Percentage of training data (%) Percentage of training data (%)
(a) ToN data set (b) ISP data set
Fig. 5. Overall-accuracy vs training-size
performance and per-class performance. To observe the impact method, RF, by 3 to 17 percent for the ToN data set and
of noisy samples, we vary the number of noisy samples with 2 to 7 percent for the ISP data set. For instance, when the
respect to a fixed number of training samples and the number training data have 40 percent noise, the overall accuracy of
of training samples with a fixed number of noisy samples, NSTC is higher than that of RF by 10 percent for the ToN
respectively. We report the average results of the 100 iterations data set and by 7 percent for the ISP data set, respectively.
of our experiments. RF is the second best method, which is significantly better
1) Overall Performance: The overall performance is eval- than SVM and NN. The result shows that the NSTC method
uated in terms of average accuracy against various size of can effectively improve the classification accuracy through
noise data and training data. Fig. 4 shows that the increas- aggregating the techniques of eliminating and tolerating noise.
ing density of noisy samples negatively affects the overall In addition, the trend reports a decline of the overall accuracy
classification accuracy for the two data sets; Fig. 5 shows when the noise ratio increases. The gap between the NSTC
that the increasing number of training samples with the fixed method and other methods becomes bigger on the two data
number of mislabelled samples positively improve the overall sets with the increment of noise ratio. Our results show that
classification accuracy for the two data sets. According to our the proposed NSTC method has superior performance in the
experiment results, the NSTC scheme outperforms the well- presence of a high noise percentage.
known classfiers including the support vector machine (SVM), With respect to different training data sizes, as shown in
nearest neighbour (NN) and random forest (RF) [5], [14], [43]. Fig. 5, the NSTC method is also superior to other methods on
In Fig. 4, our proposed NSTC scheme consistently out- the two data sets. We change the training data size and keep
performs other methods when there are increasing portions the same ratio of noise by randomly selecting samples from
of noisy samples in the training data. The overall accuracy the pre-organized training set. The size varies from 10 to 100
of the NSTC scheme is higher than that of the second best percent of the pre-organised training set. The overall accuracy
2332-7790 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2735996, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA, VOL. 12, NO. 6, Aug 2017 8
60 60 60
F-measure (%)
F-measure (%)
F-measure (%)
50 50 50
40 40 40
30 30 30
20 20 SVM 20
NN
10 10 RF 10
0
NSTC 0
0
20 25 30 35 40 45 50 20 25 30 35 40 45 50 20 25 30 35 40 45 50
Percentage of mislabelled data (%) Percentage of mislabelled data (%) Percentage of mislabelled data (%)
F-measure (%)
F-measure (%)
60
50 50
50
40 40 40
30 30 30
20 SVM 20 SVM
20
NN NN
10 RF 10 10 RF
NSTC NSTC
0 0 0
20 25 30 35 40 45 50 20 25 30 35 40 45 50 20 25 30 35 40 45 50
Percentage of mislabelled data (%) Percentage of mislabelled data (%) Percentage of mislabelled data (%)
90 90 90
80 80 80
70 70 70
60 60 60
F-measure (%)
F-measure (%)
F-measure (%)
50 50 50
40 40 40
30 30 30 SVM
SVM SVM NN
20 20 20 RF
NN NN
10 10
NSTC
RF 10 RF
0
NSTC 0
NSTC
0
20 25 30 35 40 45 50 20 25 30 35 40 45 50 20 25 30 35 40 45 50
Percentage of mislabelled data (%) Percentage of mislabelled data (%) Percentage of mislabelled data (%)
90 90 SVM
NN
80 80 RF
70 70
NSTC
60 60
F-measure (%)
F-measure (%)
50 50
40 40
30 30
20 SVM 20
NN
10 RF 10
0
NSTC
0
20 25 30 35 40 45 50 20 25 30 35 40 45 50
Percentage of mislabelled data (%) Percentage of mislabelled data (%)
of the NSTC method is consistently higher than the second sizes. The overall accuracy of all classifiers increases when
best method, RF, by up to 9 percent for the ToN and 10 percent more training data are used. Especially, the increase of the
for the ISP data sets, respectively. The RF’s performance is training size can immediately improve the overall accuracy of
much higher than that of SVM and NN. For instance, when the the NSTC method in the presence of unclean training samples.
size of training data reaches 70 percent of the pre-organized For example, when the training size reaches 60 percent, the
training set, the overall accuracy of NSTC is higher than that overall accuracy of NSTC method is improved by 15 percent
of the second best method RF by approximately 8 percent for the ToN data set and by 20 percent for the ISP data
for the ToN data set and 10 percent for the ISP data set. sets, respectively. On the contrary, there is inconsistent and
The results confirm the effectiveness of the proposed NSTC sometimes little improvement for the other three methods
method, which is robust against the change of training data when the training data are increasing.
2332-7790 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2735996, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA, VOL. 12, NO. 6, Aug 2017 9
90 90 90 SVM
NN
80 80 80 RF
70
NSTC
70 70
60 60 60
F-measure (%)
F-measure (%)
F-measure (%)
50 50 50
40 40 40
30 30 SVM 30
NN
20 SVM 20 RF 20
NN NSTC
10 RF 10 10
NSTC
0 0 0
20 25 30 35 40 45 50 20 25 30 35 40 45 50 20 25 30 35 40 45 50
Percentage of mislabelled data (%) Percentage of mislabelled data (%) Percentage of mislabelled data (%)
90 90 SVM 90 SVM
NN NN
80 80 RF 80
RF
NSTC 70 NSTC
70 70
60 60 60
F-measure (%)
F-measure (%)
F-measure (%)
50 50 50
40 40 40
30 30 30
20 SVM 20 20
NN
10 RF 10 10
NSTC 0
0 0
20 25 30 35 40 45 50 20 25 30 35 40 45 50 20 25 30 35 40 45 50
Percentage of mislabelled data (%) Percent of mislabelled data (%) Percentage of mislabelled data (%)
F-measure (%)
F-measure (%)
60
50 50
50
40 40 40
30 30 30
20 25 30 35 40 45 50 20 25 30 35 40 45 50 20 25 30 35 40 45 50
Percentage of mislabelled data (%) Percentage of mislabelled data (%) Percentage of mislabelled data (%)
90 90
80 80
70 70
60 60
F-measure (%)
F-measure (%)
50 50
40 40
30 30 SVM
20 SVM 20 NN
RF
NN
10 10 NSTC
RF
NSTC
0 0
20 25 30 35 40 45 50 20 25 30 35 40 45 50
Percentage of mislabelled data (%) Percentage of mislabelled data (%)
2) Per Class Performance: We use F-measure to assess the but NSTC can further improve the performance by 4 to 12
per-class performance of traffic classification. We compare percent over the performance of the second best method, RF.
the performance of the NSTC scheme with other methods FTP is not easy to classify for the conventional methods, but
including SVM, NN and RF. To evaluate the impact of NSTC can improve its F-measure by up to 20 percent. Though
mislabelled training samples, we change the noise ratio of the the SMALL class could not be identified easily by NSTC,
training data set from 20 percent to 50 percent. NSTC’s performance is still the highest. For the classes of
Fig. 6 shows the F-measure of each class in the ToN data POP3 and SSH, the NSTC’s F-measure can reach over 90
set. The performance of NSTC is consistently better than all percent that is not significantly affected by the noise ratio.
other three methods, no matter what the noise ratio is. For The results demonstrate that the proposed NSTC method can
example, SSH is the easiest class for the conventional methods, improve the F-measure of all classes in the ToN data set.
2332-7790 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2735996, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA, VOL. 12, NO. 6, Aug 2017 10
100
percent.
90
Hence, the above results confirm the effectiveness of the
80 proposed NSTC method.
70 3) Further Evaluation: The use of correlated traffic flows
Overall Accuracy (%)
60
can further improve the traffic classification performance [14].
We are inspired by the idea of Traffic Classification using
50
Correlation (TCC) [14] to incorporate traffic flow correlation
40
into NSTC. To evaluate the effectiveness of the improved
30 NSTC, we conduct a number of tests and compare the results
20
of the improved NSTC and TCC in the presence of unclean
training samples.
10 TCC Improved NSTC
Fig. 8 depicts the overall accuracy of NSTC and TCC on
0
20 25 30 35 40 45 50
the ToN data set. Please note that the results on the ISP data
Percentage of mislabelled data (%) set are similar. The results show that the improved NSTC has
much better overall accuracy than TCC. For instance, when the
Fig. 8. Overall Accuracy of TCC vs NSTC training data contain 30 percent noise, the overall accuracy of
NSTC is higher than TCC by 10 percent. When the percentage
of mislabelled data is increased to 50 percent, TCC’s accuracy
reduced to about 75%, while NSTC’s accuracy is still over
3000
80%. The results suggest that TCC struggles with handling
mislabelled training data and flow correlation can be used to
2500
further improve NSTC’s performance.
2000 Fig. 10 reports the F-measure of NSTC and TCC on the ToN
1500 data set. The F-measure of NSTC significantly outperforms
TCC in each class, no matter what the noise ratio is. For
1000
example, in class RAZOR, NSTC’s F-measure is higher than
500
TCC’s F-measure about 20 percent. In class DNS, the different
0
50
45 is about 15 percent. In some classes, such as HTTP, SSL3
40 and SMTP, TCC’s F-measure is very good, but NSTC can
35
30
further improve it. NSTC delivers consistent and reliable
25 results, which confirms its capability of addressing mislabelled
20 training data.
We also evaluate the classification time of three methods,
SVM, NN and NSTC. The comparative results of classification
Fig. 9. Computation time
time are listed in Fig. 9. The listed computation time excludes
the time used during the preprocessing step when the noisy
That is, the techniques of eliminating and tolerating noise can samples are injected to the training data. In this comparison,
effectively address the problem of mislabelled training data we focus on classification time that is crucial for online traffic
from the per-class perspective. classification. As shown in Fig. 9, NSTC is significantly faster
Fig. 7 shows the F-measure of each class in the ISP data than SVM and NN. More specifically, NSTC extends an RF
set. NSTC consistently achieves the highest F-measure for classifier for traffic classification and the noise ratio does not
each class. Here, we divide all the network classes into three affect the classification procedure. Therefore, NSTC has high
categories according to the performance of the conventional efficiency and is suitable to online traffic classification.
methods, i.e., easy, average and hard:
• BT, POP3, SMTP and SSH are in the easy category.
V. CONCLUSION
Although the space for improvement is small, NSTC can
further improve the performance by up to 15 percent. For This paper presented a real-world challenge that network
instance, the improvement is 12 percent for POP3 when traffic classification struggles to perform well in the presence
the training data include 50 percent noise. of mislabelled samples in the training data. That is, when
• DNS, HTTP, IMAP, MSN, SSL3 and XMPP belong to mislabelled traffic samples are present, conventional traffic
the average category, where NSTC significantly improves classification methods cannot sustain their performance. We
F-measure. For example, for XMPP, the F-measure of proposed a novel traffic classification method, noise-resistant
NSTC is higher than the second best method, RF, by statistical traffic classification (NSTC), which can identify
about 20 percent when the noise ratio is 25 percent. noisy examples and tolerate suspected noisy samples. We
• FTP is in the hard category, where its F-measure of the provided the empirical and theoretical study to demonstrate
conventional methods is much lower than 50 percent. the performance benefit of the new NSTC method compared
NSTC can still improve its performance by about 5 to the existing methods. The experiments and results show that
2332-7790 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2735996, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA, VOL. 12, NO. 6, Aug 2017 11
90 90 90
80 80 80
70 70 70
F-measure (%)
60 60
F-measure (%)
60
F-measure (%)
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
TCC Improved NSTC TCC Improved NSTC TCC Improved NSTC
0 0 0
20 25 30 35 40 45 50 20 25 30 35 40 45 50 20 25 30 35 40 45 50
Percentage of mislabelled data (%) Percentage of mislabelled data (%) Percentage of mislabelled data (%)
60 F-measure (%) 60
F-measure (%)
60
50 50
50
40 40
40
30 30
30
20 20 20
10 10 10
TCC Improved NSTC TCC Improved NSTC TCC Improved NSTC
0 0 0
20 25 30 35 40 45 50 20 25 30 35 40 45 50 20 25 30 35 40 45 50
Percentage of mislabelled data (%) Percentage of mislabelled data (%) Percentage of mislabelled data (%)
90 90 90
80 80 80
70 70 70
60
F-measure (%)
60
F-measure (%)
60
F-measure (%)
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
TCC Improved NSTC TCC Improved NSTC TCC Improved NSTC
0 0 0
20 25 30 35 40 45 50 20 25 30 35 40 45 50 20 25 30 35 40 45 50
Percentage of mislabelled data (%) Percentage of mislabelled data (%) Percentage of mislabelled data (%)
90 90
80 80
70 70
F-measure (%)
60
F-measure (%)
60
50 50
40 40
30 30
20 20
10 10
TCC Improved NSTC TCC Improved NSTC
0 0
20 25 30 35 40 45 50 20 25 30 35 40 45 50
NSTC delivers consistently superior performance to other traf- [3] Y. Xiang, W. Zhou, and M. Guo, “Flexible deterministic packet marking:
fic classification schemes in the presence of unclean training An IP traceback system to find the real source of attacks,” IEEE Trans.
Parallel Distrib. Syst., vol. 20, no. 4, pp. 567–580, Apr. 2009.
data. [4] M. A. Ashraf, H. Jamal, S. A. Khan, Z. Ahmed, and M. I. Baig, “A
Heterogeneous Service-Oriented Deep Packet Inspection and Analysis
REFERENCES Framework for Traffic-Aware Network Management and Security Sys-
tems,” IEEE Access, vol. 4, pp. 5918–5936, 2016.
[1] T. T. T. Nguyen and G. Armitage, “A survey of techniques for Internet [5] J. Zhang, X. Chen, Y. Xiang, W. Zhou, and J. Wu, “Robust Network
traffic classification using machine learning,” IEEE Commun. Surveys Traffic Classification,” IEEE/ACM Trans. Netw., vol. 23, no. 4, pp. 1257–
Tuts., vol. 10, no. 4, pp. 56–76, 4th Quart., 2008. 1270, Aug. 2015.
[2] S. Mavoungou, G. Kaddoum, M. Taha, and G. Matar, “Survey on Threats [6] S. Tatinati, K. C. Veluvolu and W. T. Ang, “Multistep Prediction of
and Attacks on Mobile Networks,” IEEE Access, vol. 4, no. , pp. 4543– Physiological Tremor Based on Machine Learning for Robotics Assisted
4572, 2016. Microsurgery,” IEEE Trans. Cybern., vol. 45, no. 2, pp. 328–339, Feb.
2332-7790 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2735996, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA, VOL. 12, NO. 6, Aug 2017 12
2015. [31] D. Liu and C. Lung, “P2P traffic identification and optimization using
[7] D. Kelly and B. Caulfield, “Pervasive Sound Sensing: A Weakly fuzzy c-means clustering,” in Proc. IEEE Int. Conf. Fuzzy Syst., pp.
Supervised Training Approach,” IEEE Trans. Cybern., vol. 46, no. 1, 2245–2252, 2011.
pp.123–135, Jan. 2015. [32] L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian,
[8] A. W. Moore and D. Zuev, “Internet traffic classification using Bayesian “Traffic Classification on the Fly,” SIGCOMM Comput. Commun. Rev.,
analysis techniques,” in Proc. SIGMETRICS, vol. 33, pp. 50–60, Jun. vol. 36, no. 2, pp. 23–26, Apr. 2006.
2005. [33] J. Erman, M. Arlitt, and A. Mahanti, “Traffic Classification Using
[9] T. Auld, A. Moore, and S. Gull, “Bayesian neural networks for internet Clustering Algorithms,” in Proc. ACM SIGCOMM, pp. 281–286, 2006.
traffic classification,” IEEE Trans. Neural Netw., vol. 18, no. 1, pp. 223– [34] Y. Wang, Y. Xiang, and S.-Z. Yu, “An Automatic Application Signature
239, Jan. 2007. Construction System for Unknown Traffic,” Concurrency Computat.
[10] A. Este, F. Gringoli, and L. Salgarelli, “Support vector machines for Pract. Exper., vol. 22, pp. 1927–1944, 2010.
TCP traffic classification,” Comput. Netw., vol. 53, no. 14, pp. 2476– [35] J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and C. Williamson, “Of-
2490, 2009. fline/realtime traffic classification using semi-supervised learning,” Per-
[11] B. Hullár, S. Laki, and A. Gyorgy, “Early identification of peer-to-peer form. Eval., vol. 64, no. 9, pp. 1194–1213, Oct. 2007.
traffic,” in Proc. IEEE Int. Conf. Commun., pp. 1–6, 2011. [36] Z. Liu, R. Wang, M. Tao, and X. Cai, “A class-oriented feature selection
[12] T. T. T. Nguyen and G. Armitage, “Training on multiple sub-flows approach for multi-class imbalanced network traffic datasets based on
to optimize the use of machine learning classifiers in real-world IP local and global metrics fusion,” Neurocomputing, vol. 168, pp. 365–
networks,” in Proc. 31st IEEE Conf. Local Comput. Netw., pp. 369– 381, 2015.
376, 2006. [37] H. Zhang, G. Lu, M. T. Qassrawi, Y. Zhang, and X. Yu, “Feature
[13] P. Bermolen, M. Mellia, M. Meo, D. Rossi, and S. Valenti, “Abacus: selection for optimizing traffic classification,” Comput. Commun., vol.
Accurate behavioral classification P2P-TV traffic,” Comput. Netw., vol. 35, no. 12, pp. 1457–1471, 2012.
55, no. 6, pp. 1394–1411, 2011. [38] M. A. Ambusaidi, X. He, P. Nanda, and Z. Tan, “Building an Intrusion
[14] J. Zhang, Y. Xiang, Y. Wang, W. Zhou, Y. Xiang, and Y. Guan, “Network Detection System Using a Filter-Based Feature Selection Algorithm,”
traffic classification using correlation information,” IEEE Trans. Parallel IEEE Transactions on Computers, vol. 65, no. 10, pp. 2986–2998, Oct.
Distrib. Syst., vol. 24, no. 1, pp. 104–117, Jan. 2013. 1 2016.
[15] E. Glatz and X. Dimitropoulos, “Classifying internet one-way traffic,” [39] A. Fahad, Z. Tari, I. Khalil, A. Almalawi, and A. Y. Zomaya, “An
in Proc. ACM SIGMETRICS/PERFORMANCE Joint Int. Conf. Meas. optimal and stable feature selection approach for traffic classification
Model. of Comput. Syst., pp. 417–418, 2012. based on multi-criterion fusion,” Future Gen. Comput. Syst., vol. 36,
[16] Y. Jin, N. Duffield, J. Erman, P. Haffner, S. Sen, and Z.-L. Zhang, “A pp. 156–169, 2014.
modular machine learning system for flow-level traffic classification in [40] N. F. Huang, G. Y. Jai, H. C. Chao, Y. J. Tzang, and H. Y. Chang,
large networks,” ACM Trans. Knowl. Discov. Data, vol. 6, no. 1, pp. “Application traffic classification at the early stage by characterizing
4:1–4:34, Mar. 2012. application rounds,” Inform. Sci., vol. 232, pp. 130–142, 2013.
[41] V. Soto, S. Garcı́a-Moratilla, G. Martinez-Munoz, D. Hernández-Lobato,
[17] A. Callado, J. Kelner, D. Sadok, C. Alberto Kamienski, and S. Fer-
and A. Suárez, “A double pruning scheme for boosting ensembles,” IEEE
nandes, “Better network traffic identification through the independent
Trans. Cybern., vol. 44, no. 12, pp. 2682–2695, Dec. 2014.
combination of techniques,” J. Netw. Comput. Appl., vol. 33, no. 4, pp.
[42] I. Guyon and A. Elisseeff, “An introduction to variable and feature
433–446, 2010.
selection,” J. Mach. Learning Res., vol. 3, pp. 1157–1182, 2003.
[18] G. Xie, M. Iliofotou, R. Keralapura, M. Faloutsos, and A. Nucci, “Sub-
[43] H. Kim, K. Claffy, M. Fomenkov, D. Barman, M. Faloutsos, and K.
Flow: Towards practical flow-level traffic classification,” in Proc. IEEE
Lee, “Internet Traffic Classification Demystified: Myths, Caveats, and
INFOCOM, pp. 2541–2545, 2012.
the Best Practices,” in Proc. ACM CoNEXT, pp. 1–12, 2008.
[19] M. AlSabah, K. Bauer, and I. Goldberg, “Enhancing Tor’s performance [44] Weka 3: Data Mining Software in Java. http:// www.cs.waikato.ac.nz/ ml/
using real-time traffic classification,” in Proc. ACM Conf. Computer and weka.
Comm. Security, pp. 73–84, 2012.
[20] R. Wang, L. Shi, and B. Jennings, “Ensemble classifier for traffic in
presence of changing distributions,” in Proc. IEEE Symposium Comput.
Commun., pp. 629–635, 2013.
[21] L. Grimaudo, M. Mellia, and E. Baralis, “Hierarchical learning for fine
grained internet traffic classification,” in Proc. IEEE Int. Conf. IWCMC,
pp. 463–468, 2012.
[22] T. T. T. Nguyen, G. Armitage, P. Branch, and S. Zander, “Timely
and continuous machine-learning-based classification for interactive IP
traffic,” IEEE/ACM Trans on Nets., vol. 20, no. 6, pp. 1880–1894, Dec.
2012.
[23] M. Jaber, R. G. Cascella, and C. Barakat, “Using host profiling to
refine statistical application identification,” in Proc. IEEE INFOCOM,
pp. 2746–2750, 2012.
[24] J. Erman, A. Mahanti, M. Arlitt, I. Cohenz, and C. Williamson, “Semi-
supervised network traffic classification,” SIGMETRICS Perform. Eval.
Rev., vol. 35, no. 1, pp. 369–370, 2007.
[25] X. Li, F. Qi, D. Xu, and X. Qiu, “An internet traffic classification method
based on semi-supervised support vector machine,” in Proc. IEEE ICC.
pp. 1–5, 2011.
[26] S. C. Chao, K. C. J. Lin, and M. S. Chen, “Flow Classification for
Software-Defined Data Centers Using Stream Mining,” IEEE Transac-
tions on Services Computing, vol. PP, no. 99, pp. 1–1, 2016.
[27] C. L. Liu, W. H. Hsaio, C. H. Lee and F. S. Gou, “Semi-supervised
linear discriminant clustering,” IEEE Trans. Cybern., vol. 44, no. 7, pp.
989–1000, Jul. 2014.
[28] A. McGregor, M. Hall, P. Lorier, and J. Brunskill, “Flow Clustering
Using Machine Learning Techniques,” in Proc. PAM Workshop France,
pp. 205–214, Apr. 2004.
[29] J. Erman, A. Mahanti, and M. Arlitt, “Internet traffic identification using
machine learning,” in Proc. 49th IEEE GLOBECOM, pp. 1–6, Dec.
2006.
[30] J. Erman, M. Arlitt, and A. Mahanti, “Traffic classification using
clustering algorithms,” in Proc. ACM SIGCOMM Workshop, pp. 281–
286, 2006.
2332-7790 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.