Sie sind auf Seite 1von 10

BIG DATA MINING AND ANALYTICS

ISSN 2096-0654 04/06 pp 47– 56


Volume 1, Number 1, March 2018
DOI: 10.26599/BDMA.2018.9020005

Online Internet Traffic Monitoring System Using Spark Streaming

Baojun Zhou, Jie Li , Xiaoyan Wang, Yu Gu, Li Xu, Yongqiang Hu, and Lihua Zhu

Abstract: Owing to the explosive growth of Internet traffic, network operators must be able to monitor the entire
network situation and efficiently manage their network resources. Traditional network analysis methods that usually
work on a single machine are no longer suitable for huge traffic data owing to their poor processing ability. Big data
frameworks, such as Hadoop and Spark, can handle such analysis jobs even for a large amount of network traffic.
However, Hadoop and Spark are inherently designed for offline data analysis. To cope with streaming data, various
stream-processing-based frameworks have been proposed, such as Storm, Flink, and Spark Streaming. In this
study, we propose an online Internet traffic monitoring system based on Spark Streaming. The system comprises
three parts, namely, the collector, messaging system, and stream processor. We considered the TCP performance
monitoring as a special use case of showing how network monitoring can be performed with our proposed system.
We conducted typical experiments with a cluster in standalone mode, which showed that our system performs well
for large Internet traffic measurement and monitoring.

Key words: spark streaming; network monitoring; big data; TCP performance monitoring

amount of traffic to be analyzed. According to the


1 Introduction
latest statistics by Cisco[1] , the annual global IP traffic
To provide a secure and well-performing network was 1.2 zettabyte (ZB) in 2016 and expected to reach
for the continually transforming cyberspace, Internet 3.3 ZB by 2021. The rapid growth of traffic volume
operators need to monitor and analyze the network has imposed great challenges for traditional Internet
status in real time. However, this is difficult nowadays monitoring platforms.
due to the huge scalability of networks and the huge Traditionally, Internet traffic measurement and
analysis have been executed on a high-performance
 Baojun Zhou and Jie Li are with the Department of central server[2] . However, due to computing ability
Computer Science, University of Tsukuba, Tsukuba 305-
limitations, central servers cannot cope with large
8577, Japan. E-mail: zhoubaojun@osdp.cs.tsukuba.ac.jp;
lijie@cs.tsukuba.ac.jp. volumes of data in a short period of time. Nowadays,
 Xiaoyan Wang is with the College of Engineering, this renders them unsuitable given the huge amount
Ibaraki University, Hitachi 316-8511, Japan. E-mail: of Internet traffic today. For example, when a DDoS
xiaoyan.wang.shawn@vc.ibaraki.ac.jp. attack occurs, a monitoring system is required in order
 Yu Gu is with the School of Computer and Information, Hefei
to deal with the huge amount of Internet traffic, which
University of Technology, Hefei 230601, China. E-mail:
yugu.bruce@ieee.org. is a tough task for a single server. Various monitoring
 Li Xu is with the College of Mathematics and Computer methods use packet sampling in order to reduce the
Science, Fujian Normal University, Fuzhou 350007, China. E- amount of input data. However, this produces an
mail: xuli@fjnu.edu.cn. inaccurate result[3] . Moreover, a single server makes the
 Yongqiang Hu and Lihua Zhu are with the Institute of Scientific
and Technical Information of Qinghai, Xining 810008, China.
system vulnerable to failures. If the server crashed, we
E-mail: yqhua@163com; zlh97330@163.com. would not be able to recover it quickly without affecting
 To whom correspondence should be addressed. the ongoing task[4] .
Manuscript received: 2017-08-11; accepted: 2017-11-30 In this study, we propose an online Internet traffic
48 Big Data Mining and Analytics, March 2018, 1(1): 47-56

monitoring system based on Spark Streaming, which Network traffic analysis and monitoring can be
is a big data platform that can efficiently process a regarded as a statistics problem for a set of packets in
huge amount of traffic data so that we can monitor the a period of time. Therefore, micro-batching may be a
network status in real time and is robust enough so as to better choice for a network monitoring system. In this
suffer a failure without aborting the entire monitoring study, we propose an online Internet traffic monitoring
process. system based on Spark Streaming. This could be
Big data platforms, such as Hadoop and Spark, used for data traffic performance analysis, such as TCP
provide an efficient way of processing a huge amount performance.
of data. For example, the MapReduce model and Our contributions in this study are as follows:
its open-source version, Hadoop[5] , have been widely  We propose a distributed architecture as an online
adopted by the big data analytics community due to Internet traffic measurement and monitoring
their simplicity and ease of programming[6] . However, system.
the intermediate data of Hadoop are stored on disk  We implement a parallel algorithm for monitoring
(which usually has poor I/O performance); therefore, TCP performance parameters, such as delay and
there will be dramatic performance degradation for retransmission ratio with a very short delay.
algorithms requiring plenty of iterations. To improve  We conduct typical experiments showing that the
the performance of Hadoop, in-memory computing proposed system is feasible and efficient.
methods, such as Apache Spark[7] , have been proposed. The rest of this study is organized as follows. In
The intermediate data in Spark are stored in Resilient Section 2, we introduce various related works on real-
Distributed Datasets (RDD), which are cached in time Internet monitoring. In Section 3, we describe
memory; therefore, data can be processed much faster the architecture of our proposed monitoring system
in comparison with Hadoop[8] . and introduce the tools we used. In Section 4, we
Some offline Internet monitoring systems[2, 9–11] have consider TCP performance analysis as an example of
used big data platforms to improve their processing how the network monitoring problem can be solved
efficiency. However, only a few studies[12, 13] have by the stream-processing method in our System. In
focused on online network monitoring. Section 5, we evaluate the performance of the proposed
Both Hadoop and Spark are based on batch system and present the experimental results. Finally, we
processing, which is suitable for offline data analysis. conclude this study in Section 6.
Batch processing is applied to process large datasets,
2 Related Work
where operations on multiple data items can be batched
for efficiency[14] . This requires input data to be readily Cyberspace is dynamical and vulnerable to attacks.
accessible when the calculation begins so that all of Therefore, it requires network providers to monitor the
the data can be simultaneously processed. Online status of their network in real time. Online Internet
Internet traffic monitoring resembles a stream analytics traffic monitoring technologies have been extensively
problem, where the input is an unbounded sequence of studied. In 1999, Paxson[16] proposed the Bro system to
data. Although MapReduce does not support stream detect network intruders in real time. Bro first captured
processing, it can partially handle streams using a a packet stream using libpcap and then reduced the
technique known as micro-batching. Here the stream incoming stream into a series of higher-level events
is treated as a sequence of small batch data chunks. using an event engine. They also proposed a custom
At short intervals, the incoming stream is packed to scripting language called Bro scripts, which can be
a chunk of data and is delivered to the batch system executed by the policy script interpreter to deal with
for processing[15] . Spark, for example, has provided events. Although Bro is single threaded, it can be set
the Spark Streaming library to support this technique. up in a high throughput cluster environment. Similar
In addition, other platforms exist and are inherently studies include Snort[17] and Suricata[18] , which are
designed for big data streams, such as Apache Storm inherently based on single-machine computing.
and S4, where data are processed through several Various related studies have been conducted on
computing nodes. Each node can process one or more online Internet traffic measurement and monitoring
input stream(s) and generate a set of output streams. using Spark. Gupta et al.[12] used Spark Streaming
Data will be processed as soon as they arrive. to analyze the network in real time. They presented
Baojun Zhou et al.: Online Internet Traffic Monitoring System Using Spark Streaming 49

three network monitoring applications that can be processes the input data transmitted from the collectors.
expressed as streaming analytics problems; namely, First, the collector preprocesses the captured packets
reflection attack monitoring, application performance and only sends necessary protocol header data to the
analysis, and port scan detection. They made use of stream processor to reduce the amount of input data.
programmable switches, such as OpenFlow switches, Moreover, we use a messaging system as a bridge to
to extract only the traffic that was of interest, which help the data transmission from the collectors to the
reduced the data that needed to be processed. However, stream processor.
their system was not feasible for networks using non 3.1 Collector
programmable switches. In this study, we propose an
Internet traffic measurement and monitoring system that To implement the collector component, we used
works on both programmable and non programmable Jpcap, which is an open-source Java library, to capture
switches. Karimi et al.[13] proposed a distributed packets and efficiently extract the required packet
network traffic feature extraction method with Spark header data such as time stamp and TCP segment
for a real time intrusion detection system. They used number. To reduce the number of captured packets, we
a collector component to capture packets from the set a filter (protocol, destination IP address, etc.) for
switch and extract the required information from packet Jpcap so that it captures only the packet that we are
headers. These headers are written in CSV files and interested in.
separated by the time window. Then, Spark periodically 3.2 Messaging system
reads data from the CSV files within a small-time We use Kafka to develop our messaging system. Kafka
window to make it nearly real-time data. However, is a high-performance distributed messaging system for
the periodic writing and reading of files degrades the record streams[19] . The collectors publish the required
performance of Spark as an online Internet traffic packet header data as messages on Kafka using a
monitoring system. Our system uses Spark Streaming common topic. Then, the stream processor subscribes
to directly cope with the stream in order to achieve a to the topic and obtains the message stream from Kafka.
higher speed.
3.3 Stream processor
3 Architecture of the Online Internet Traffic The stream processor component is a cluster running
Monitoring System Spark Streaming. It processes the packet information
collected by the collectors and shows the monitoring
As illustrated by Fig. 1, the proposed online monitoring
results to the operators. This is similar to the Spark
system comprises three components, namely the
program for batch processing, whose logic is expressed
collector, messaging system, and stream processor. A
by RDD transformations. The logic in Spark Streaming
collector is a device that is used to capture packets
is expressed by Discretized Stream (DStream), which
from the network. It captures all the inbound and
is an internal sequence of RDDs, transformations[20] .
outbound packets from a switch using port mirroring.
Spark Streaming has provided multiple transformation
To capture packets from multiple switches, multiple
APIs, such as flatMap(), which maps each input
collectors may be present in the system. The stream
item in the source DStream to 0, or more output
processor is the core component of our system and
items, and groupByKey(), which can group key-
value pairs together according to their keys. Some


  useful transformation APIs are listed in Table 1[20] .


  
 
4 TCP Performance Monitoring



    TCP is a widely used Internet protocol. In this


 
 
 
 
study, we focus on TCP performance monitoring. The
performance monitoring problems of other protocols


    can be dealt with in a similar way. We used our system

 
to calculate the TCP segment throughput, average
Fig. 1 Architecture of the proposed online monitoring Round-Trip Time (RTT), retransmission ratio, and out-
system.
of-order ratio from the Internet traffic stream in real
50 Big Data Mining and Analytics, March 2018, 1(1): 47-56

Table 1 Various useful transformation Spark Streaming is sent to the receiver.


APIs. In normal TCP conversation, the sequence number
Transformation Meaning of SYN/FIN or a segment containing data should not
map() Map each element in the source stream to be smaller than the maximum next expected sequence
a new value. number in previous segments. Otherwise, it must be
flatMap() Similar to map(), but each element can be a segment that is either retransmitted or out-of-order.
mapped to 0 or more output items.
Retransmission means that a segment is considered
mapValues() Map the value of each key-value pair
lost and is therefore resent by the sender; out-of-
without change the key.
reduce() Aggregate each 2 elements in the source
order means that the segments are not received by
stream to 1 new element. the receiver in the order they are sent. In fact, it
reduceByKey() Aggregate 2 key-value pairs with the same is quite difficult to identify whether, in some cases,
key to a new key-value pair. a segment is retransmitted or just out-of-order. Here
groupByKey() Group all key-value pairs with the same we use a simple assumption to determine the out of
key together. order segment. In other words, if the time stamp of
countByValue() Count the frequency of each element, and the segment is very close (within three milliseconds) to
return a key-value pair stream whose key
the segment with the highest next expected sequence
is the element, value is the count.
number, we regard it as an out-of-order segment. A
join() Join two key-value pair streams (K, V)
and (K, W) together, return a new stream high retransmission and out-of-order ratio will cause
of (K, hV, Wi) pairs with all pairs of poor network performance and indicate problems in the
elements for each key. transmission paths; therefore, they are very important
metrics in TCP performance monitoring.
time. In addition, we shall consider an exception
in our retransmission and out-of-order detection;
4.1 Metrics namely, keep-alive. A TCP keep-alive segment is an
acknowledgment segment (ACK) with its sequence
4.1.1 Throughput
number set to one less than the next expected sequence
Our system can measure the number and total length of number and has a payload of 0 or 1 byte. Keep-alive
TCP segments per second. is often used to verify whether the TCP connection
4.1.2 Retransmission and out-of-order with the remote host is still available. Although it
TCP provides an ordered delivery of data. Each may carry 1 byte of data and does not advance the
segment’s protocol header contains a sequence sequence number, it is not a retransmitted or out-of-
number (SEQ), which specifies the position of this order segment. A typical TCP keep-alive is shown in
segment. When a TCP connection is established, both Fig. 2.
the sender and receiver in the connection are assigned 4.1.3 RTT
an initial sequence number. The offset between the RTT is the time required for a network communication
actual sequence number in a segment and the initial one to travel from the source to the destination and back. In
is called the relative sequence number and is regarded TCP, when the receiver successfully receives a segment,
as the number of bytes sent by the sender before this
    
segment.  

     !" 
From the sequence number and payload (data length) 
     !" 
of a segment, we can calculate the next expected  
 

   !" 
sequence number, which indicates the sequence number  

   !" 
 

that the next segment from the host should have. For      !" 

a normal TCP data segment that is not SYN/FIN, the      !" 
 

     !" 
next expected sequence number should equal the sum 
     !" 
of the current sequence number and payload. However,  
 

   !" 
for the SYN/FIN segment, the next expected sequence  

   !" 

number should equal the previous one plus one because


the SYN/FIN information is also considered as data that Fig. 2 Typical TCP keep-alive.
Baojun Zhou et al.: Online Internet Traffic Monitoring System Using Spark Streaming 51

it sends back a response segment that contains an ACK containing the time stamp (divided into a high and
number. The latency between the former TCP segment a low parts), source IP, source port, destination IP,
and the corresponding ACK is measured as RTT. This destination port, SYN/FIN flag, ACK flag, SEQ
is an important network parameter that reflects not number, ACK number, payload, and packet length. We
only network quality but also server response time. denote this input stream DStream0.
The ACK number tells the sender how many bytes Input stream data preprocessing is required to extract
have been successfully received by the receiver, which the information required for the calculation to be
should be equal to the next expected sequence number
performed later. We map each input data as a tuple
of the received TCP segment. In other word, besides
htime stamp, source IP, source port, destination IP,
the reverse IP addresses/ports, a TCP segment and its
destination port, boolean(SYN/FIN, or carry data),
corresponding ACK segment should also satisfy:
boolean(ACK?), sequence number, acknowledgment
SEQ number + payload (in TCP segment) D
number, next expected sequence number, payload, and
ACK number (in ACK segment) (1) packet lengthi. For a segment that is not SYN/FIN and
Especially, for SYN/FIN segment, carries no data, the next expected sequence number is
SEQ number + payload + 1 (in TCP segment) D simply its sequence number. These tuples are stored in
ACK number (in ACK segment) (2) DStream1.
4.2.2 Throughput calculation
We use relationships (1) and (2) to find the
corresponding TCP and ACK segments in the packet Throughput is very easy to calculate by Spark
list. The RTT is calculated as the difference between Streaming. First, we transform each tuple in
their time stamps. However, if there are retransmissions DStream1 to a key-value pair, whose key is hsource
in the TCP conversation, the problem becomes IP, destination IPi and value is hpacket length, 1i. Next,
complicated, such that, when an ACK is received for we use reduceByKey to sum up the values according
a retransmitted segment, we cannot determine whether to the key. Finally, we obtain the total packet length and
the ACK corresponds to the retransmission or the total number of packets for each source and destination
original segment. Even if we decide that the segment IP pair and store them in DStream2.
was lost and retransmitted, it would still be possible 4.2.3 Retransmission and out-of-order statistics
for the segment to eventually arrive after a long time or
We calculate the retransmission and out-of-order
for the segment to arrive quickly while the ACK takes
number. First, we map the tuples in DStream1
a long time to arrive. This is called acknowledgment
to a key-value pair, whose key is hsource IP, source
ambiguity[21] . In our algorithm, we select the original
port, destination IP, destination porti and value is
segment to calculate the RTT. Although it may not be a
real RTT, we think it is more meaningful than the real hboolean(SYN/FIN or contains data?), sequence
one because it shows the delay from the first time of the number, payload, next expected sequence number, time
sender sending the segment to the time of knowing that stampi. We group the key-value pairs according to the
the receiver has received the segment. key and obtain the lists of segments that share the same
source and destination IP addresses/ports. Then we can
4.2 TCP performance monitoring with the
calculate the number of retransmission and out-of-order
proposed system
segments for each list using Algorithm 1. We only count
TCP performance analysis is divided into five steps, retransmission and out-of-order for SYN/FIN segments
namely preprocessing, throughput calculation, or for segments carrying data.
retransmission and out-of-order statistics, RTT We generate a new key-value pair for each list. The
calculation, and sum up. key is a tuple of hsource IP, destination IPi, and the
4.2.1 Preprocessing value is a tuple of hnumber of retransmission, number of
We set the collectors to only capture TCP segments out-of-orderi. Because each list only contains segments
from the Internet. Spark Streaming receives header from a source IP/port to a destination IP/port, the
information from them through sockets. The generated key-value pair only contains the statistics of
information for each segment is a formatted string segments between the specific IP/ports. Therefore we
52 Big Data Mining and Analytics, March 2018, 1(1): 47-56

contains data we generate a key-value pair, whose


$OJRULWKP  *HW VWDWLVWLFV IRU HDFK VHJPHQWV OLVW
key is hsource IP address, source port, destination
'DWD$ OLVW RI VHJPHQWV WKDW KDYH WKH VDPH VRXUFH DQG
GHVWLQDWLRQ ,3SRUW IP address, destination port, next expected sequence
5HVXOW*HW WKH QXPEHU RI UHWUDQVPLVVLRQ DQG numberi and value is htime stamp, true (means it
RXWRIRUGHU is a data segment)i. If it is a segment whose ACK
VRUW  VRUWWKHOLVWLQWLPHRUGHU 
IRUHDFKVHJPHQWLQWKHVHJPHQWVOLVWGR
flag is set to true, we generate a key-value pair, whose
LILVDQ6<1),1RUFDUULHVGDWDDQG6(4  key is hdestination IP address, destination port, source
PD[1H[W6(4WKHQ 6(4QXPEHUPD[ IP address, source port, acknowledgment numberi and
QH[WH[SHFWHG6(4WKHQLWPXVWEH
HLWKHUDNHHSDOLYHUHWUDQVPLVVLRQ
value is htime stamp, false (this means it is an
RURXWRIRUGHU  ACK segment)i. Note that an ACK segment can also
LIQRWNHHSDOLYHWKHQ carry data (piggyback), and a segment can be neither
LIǻ PD[7LPH6WDPSWLPH6WDPS 
a data nor an ACK/FIN/SYN segment; therefore, each
WKHQ ZLWKLQPLOOLVHFRQGV
VLQFHWKHVHJPHQWZLWKWKH segment can finally generate 0, 1, or 2 pair(s) according
KLJKHVWQH[WH[SHFWHG to their type. Because the former TCP segment is
VHTXHQFHQXPEHU  actually a segment sent from the sender to the receiver
RXW2I 2UGHUĸRXW2I 2UGHU
 ZHDVVXPHLWLVDQ and the ACK segment is sent from the receiver to the
RXWRIRUGHU  sender, the keys of the key-value pairs can be regarded
HOVH as hsender IP address, sender port, receiver IP address,
UHWUDQVPLVVLRQĸ 
receiver port, next expected sequence numberi and
UHWUDQVPLVVLRQ   ZH
DVVXPHLWLVD hsender IP address, sender port, receiver IP address,
UHWUDQVPLVVLRQ  receiver port, acknowledgment numberi, respectively.
LIQH[WH[SHFWHG6(4LVWKHPD[RQHHYHUVHHQ WKHQ We use groupByKey() so that the paired
 UHFRUGWKHPD[QH[WH[SHFWHG6(4 TCP/ACK segment can be grouped into the same
DQGLWVWLPHVWDPS  list. Then, we can calculate RTTs from the list using
PD[1H[W6(4ĸ QH[W6(4

Algorithm 2.
PD[7LPH6WDPSĸ WLPH6WDPS

We generate a set of key-value pairs for each sender
UHWXUQQHZNH\YDOXHSDLU 6RXUFH,3'HVWLQDWLRQ,3!
and receiver IP/port pair, whose key is hsender IP,
UHWUDQVPLVVLRQRXW2I2UGHU! 
receiver IPi, value is hRTT, 1i. Then, we use
reduceByKey() to sum up the values by key, obtain
use reduceByKey() to group all the pairs that have
the same source and destination IP and calculate their $OJRULWKP&DOFXODWHWKH577VIURPWKHOLVWRISDLUHG
total number. The group was stored in DStream3. 7&3$&. VHJPHQWV
4.2.4 RTT calculation 'DWD$ OLVW RI SDLUHG 7&3$&. VHJPHQWV
5HVXOW&DOFXODWH WKH 577
According to the relationship of the TCP and ACK pair, VRUW  VRUWWKHOLVWLQWLPHRUGHU 
if we use the tuple of hsource IP address, source port, Iĸ IDOVH  ZKHWKHUWKHUHLVDQHZ7&3
destination IP address, destination port, next expected VHJPHQWIURPVHQGHUWRUHFHLYHUDIWHU
ODVW$&.VHJPHQW 
sequence numberi as a key for the TCP segment and
IRUHDFKVHJPHQW LQ WKH OLVW GR
hdestination IP address, destination port, source IP LILV7&3VHJPHQWWKDWIURPVHQGHUWRUHFHLYHUDQG
address, source port, acknowledgment numberi as a I I DOVH WKHQ
WLPH ĸ 7&3 VHJPHQWಬV WLPH VWDPS
key for the ACK segment, then a TCP segment and its
I ĸ WUXH
corresponding ACK segment should share the same key. HOVHLILVFRUUHVSRQGLQJ$&.VHJPHQWDQGI WUXH
Therefore, we can group all TCP segments and their WKHQ
WLPH ĸ $&. VHJPHQWಬV WLPH VWDPS
corresponding ACK segments in Spark Streaming using
I ĸ I DOVH
groupByKey() and perform the parallel calculation 577 WLPH
ĸ WLPH 
of RTT for each TCP segment. JHQHUDWH QHZ NH\YDOXH SDLU K6HQGHUಬV ,3
First, we map each segment in DStream1 to several 5HFHLYHU¶V,3!577! 
key-value pairs. If it is an SYN/FIN segment or
Baojun Zhou et al.: Online Internet Traffic Monitoring System Using Spark Streaming 53

htotal RTT, total counti for each IP pair, and use The batch interval is set to 1 s, which means that
mapValues to calculate the average RTT, stored in Spark Streaming will start a computation every 1 s and
DStream4. information from all segments within this time interval
4.2.5 Sum up will be collected and processed as a batch task.
In steps 2, 3, and 4, we obtain the packet count, total Our proposed system monitors the packet rate
length, retransmission, out-of-order statistics, and the (packet/s), throughput (byte/s), average RTT (ms),
RTTs for each source IP/port and destination IP/port retransmission ratio (%), and out-of-order ratio (%) for
pair and store them in DStream2, DStream3, and each source and destination IP pair in real time. Figure 4
DStream4, respectively, so that we can use the shows the network performance that was monitored by
join() method to merge these results into one stream. our proposed system. The metrics are presented through
To present the Spark Streaming calculation results, line plots so that the network situation can be clearly
we develop a graphic user interface to show the monitored.
monitored Internet performance metrics for each We investigated the TCP monitoring performance of
specified source and destination IP pair as line charts. the proposed system both in terms of time cost and
robustness.
5 Experimental Result
5.1 Time cost
We used eight servers from Amazon Web Services to We developed a special collector to evaluate the
implement the proposed online TCP performance performance of our system. Unlike conventional
measurement system. Their configurations are collectors that capture packets from Internet in real
presented in Table 2. The structural details are time, this special collector reads packet records from
shown in Fig. 3. a large packet capture (pcap) file.
Currently, in the experimental environment, the Figure 5 shows the performance statistics of our
collector only captures packets from a single computer. stream processor. In this experiment, the average speed
of the input stream was approximately 75 000 records/s.
Table 2 Configurations for each component.
We found that our proposed system processed the input
Component Configuration stream very quickly. Because we set the batch interval
Collector 2 machines. Model t2.micro, 1 GB to 1 s, if the batch task could be processed within the
memory interval, the system would be stable. We found that the
Messaging system 1 machine. Model r4.large, High-
average total delay was 621 ms, which was within the
frequency Intel Xeon E5-2686 v4
(Broadwell) Processors, 15.25 GB
1 s interval. Therefore, it was demonstrated that our
memory. Bandwidth up to 10 Gbps
Stream processor 5 machines. Model c4.large, High-
frequency Intel Xeon E5-2666 v3
(Haswell) processors, 3.75 GB
memory. Bandwidth of 500 Mbps





 








Fig. 3 System structure. Fig. 4 Network performance measured by our system.


54 Big Data Mining and Analytics, March 2018, 1(1): 47-56

,QSXWUDWH UHFRUGVV
,QSXWUDWH UHFRUGVV
 


,QSXWUDWH  ,QSXWUDWH 
$YJUHFRUGVV  $YJUHFRUGVV
 
 
 7LPHOLQH   
7LPHOLQH
6FKHGXOLQJGHOD\ V



6FKHGXOLQJGHOD\ V


 
6FKHGXOLQJGHOD\ 6FKHGXOLQJGHOD\
$YJPV 
$YJPV 

 
   2QHVODYHIDLOHG 6ODYHUHFRYHUHG 
7LPHOLQH
7LPHOLQH

3URFHVVLQJWLPH V

3URFHVVLQJWLPH V



 
3URFHVVLQJWLPH 3URFHVVLQJ7LPH
$YJPV 
$YJPV 

 
   2QHVODYHIDLOHG 6ODYHUHFRYHUHG 
7LPHOLQH
7LPHOLQH

Fig. 5 Performance statistics of stream processor in Spark Fig. 6 System performance changes when a slave crashes.
UI, where processing time is the time taken to process all jobs
for a batch, and scheduling delay is the time to ship the jobs longer suitable. Existing approaches take advantage of
from scheduler to executor. big data frameworks to improve processing efficiency.
However, these approaches mainly focus on offline
system can handle such a high speed input stream.
data analysis. In this study, we proposed an online
We increased the packet rate and found that our
Internet traffic monitoring system that utilizes Spark
system could deal with nearly 150 000 packets per
Streaming. We demonstrated that Internet measurement
second.
and monitoring can be treated as a stream analysis
5.2 Robustness problem and can be handled via a streaming processing
The traditional network monitoring systems process on platform. Extensive experimental results show that our
a single server and abort when the server crashes. In a system achieved good performance and robustness.
cluster that is set up in Spark standalone mode, the jobs In future, we will implement collectors to capture
will be distributed to and completed by several slave packets from switches through port mirroring so
machines. When a slave breaks down, other slaves will that our system can analyze all the traffics passing
resume its remaining jobs without aborting the entire through monitored networks. Finally, we will test
process. its performance in practice and compare it with some
We conducted a typical experiment to investigate traditional single server systems in terms of scalability
the robustness of our proposed system. We manually and reliability.
shut down one slave when the program was running Acknowledgment
and restarted it later. The system status is shown
in Fig. 6. When the slave crashed, its tasks were This work was partially supported by Grant-in-Aid for
Scientific Research from Japan Society for Promotion of
retransmitted to other slaves, and we could observe a
Science (JSPS), Qinghai Joint Research Grant (No. 2016-
jitter in the scheduling delay and processing time. The
HZ-804), and Research Collaboration Grant from NII,
master had to retransmit the jobs to the slave after it
Japan.
was recovered. Therefore, a jitter was also observed
in scheduling delay. Although the failure and recovery References
of a slave impacted the performance of the system, the
[1] Cisco Visual Networking Index, Forecast and
system returned to normal operation very quickly and
methodology, 2016-2021, White Paper, San Jose,
the entire job was not aborted. These experimental CA, USA: Cisco, 2016.
results show that the system is robust. [2] Y. Lee, W. Kang, and H. Son, An Internet traffic analysis
method with MapReduce, in Proc. 2010 IEEE/IFIP
6 Conclusion and Future Work Network Operations and Management Symposium
Workshops (NOMS Wksps), Osaka, Japan, 2010, pp.
With the growth of Internet traffic, traditional network 357–361.
analysis methods that work on single machines are no [3] D. Brauckhoff, B. Tellenbach, A. Wagner, M. May,
Baojun Zhou et al.: Online Internet Traffic Monitoring System Using Spark Streaming 55

and A. Lakhina, Impact of packet sampling on anomaly on Hot Topics in Networks, Atlanta, GA, USA, 2016, pp.
detection metrics, in Proc. 6th ACM SIGCOMM Conf. Int. 106–112.
Measurement, Rio de Janeriro, Brazil, 2006, pp. 159–164. [13] A. M. Karimi, Q. Niyaz, W. Q. Sun, A. Y. Javaid, and
[4] Y. Y. Qiao, Z. M. Lei, L. Yuan, and M. J. Guo, Offline V. K. Devabhaktuni, Distributed network traffic feature
traffic analysis system based on Hadoop, J . China Univ. extraction for a real-time IDS, in Proc.2016 IEEE Int.
Posts Telecommun., vol. 20, no. 5, pp. 97–103, 2013. Conf. Electro Information Technology (EIT), Grand Forks,
[5] Hadoop, http://hadoop.apache.org/, 2017 ND, USA, 2016, pp. 522–526.
[6] K. Kambatla, G. Kollias, V. Kumar, and A. Grama, Trends [14] C. L. P. Chen and C. Y. Zhang, Data-intensive applications,
in big data analytics, J . Parallel Distrib. Comput., vol. 74, challenges, techniques and technologies: A survey on big
no. 7, pp. 2561–2573, 2014. data, Inf. Sci., vol. 275, pp. 314–347, 2014.
[7] Apache Spark, http://spark.apache.org/, 2017. [15] S. Shahrivari, Beyond batch processing: Towards real-time
[8] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and and streaming big data, Computers, vol. 3, no. 4, pp. 117–
I. Stoica, Spark: Cluster computing with working sets, in 129, 2014.
Proc. 2nd USENIX Conf. Hot Topics in Cloud Computing, [16] V. Paxson, Bro: A system for detecting network intruders
Boston, MA, USA, 2010, p. 10. in real-time, Comput. Netw., vol. 31, nos. 23&24, pp.
[9] J. Liu, F. Liu, and N. Ansari, Monitoring and analyzing big 2435–2463, 1999.
traffic data of a large-scale cellular network with Hadoop, [17] M. Roesch, Snort-lightweight intrusion detection
IEEE Netw., vol. 28, no. 4, pp. 32–39, 2014. for networks, in Proc. 13th USENIX Conf. System
[10] Y. Lee and Y. Lee, Toward scalable internet traffic Administration, Seattle, WA, USA, 1999, pp. 229–238.
[18] Suricata, https://suricata-ids.org/, 2017 .
measurement and analysis with Hadoop, ACM SIGCOMM
[19] Kafka performance, https://engineering.linkedin.com/kafka/
Comput. Commun. Rev., vol. 43, no. 1, pp. 5–13, 2013. benchmarking-apache-kafka-2-million-writessecond-
[11] Z. J. Chen, G. B. Xu, V. Mahalingam, L. Q. Ge, J.
three-cheap-machines, 2017.
Nguyen, W. Yu, and C. Lu, A cloud computing based [20] Spark Streaming, http://spark.apache.org/docs/latest/
network monitoring and threat detection system for critical streaming-programming-guide.html, 2017.
infrastructures, Big Data Res., vol. 3, pp. 10–23, 2016. [21] Acknowledgment ambiguity, http://www.tcpipguide.com/
[12] A. Gupta, R. Birkner, M. Canini, N. Feamster, C. Mac- free/tTCPAdaptiveRetransmissionandRetransmissionTimer
Stoker, and W. Willinger, Network monitoring as a Cal-2.htm, 2017.
streaming analytics problem, in Proc. 15th ACM Workshop

Baojun Zhou received the BE degree is a senior member of IEEE and ACM and a member of IPSJ
from Huazhong University of Science and (Information Processing Society of Japan). He serves as an
Technology, Wuhan, China, in 2014. He is associated editor for many international journals and transactions
currently working toward the ME degree at such as IEEE IoT Journal, IEEE Access Journal, IEEE Trans on
the University of Tsukuba, Tsukuba, Japan, TVT, and IEEE Trans on Cloud Computing. He is the founding
in the Department of Computer Science. Chair of Technical Committee on Big Data (TCBD), IEEE
His research interests include big data, ComSoc. He has also served on the program committees for
machine learning, and network security. several international conferences such as IEEE INFOCOM,
IEEE GLOBECOM, and IEEE MASS.
Jie Li received the BE degree from
Zhejiang University, Hangzhou, Xiaoyan Wang received the BE degree
China, the ME degree in electronic from Beihang University, China, and the
engineering and communication systems ME and PhD degrees from the University
from China Academy of Posts and of Tsukuba, Japan, in 2010 and 2013,
Telecommunications, Beijing, China, in respectively. He is currently working as
1982 and 1985, respectively. He received an assistant professor with the College of
the Dr. Eng. degree from the University Engineering at Ibaraki University, Japan.
of Electro-Communications, Tokyo, Japan in 1993. He is with Before that, he worked as an assistant
Faculty of Engineering, Information and Systems, University of professor at National Institute of Informatics (NII), Japan, from
Tsukuba, Japan, where he is a professor. His current research 2013 to 2016. His research interests include wireless networks,
interests are in mobile distributed computing and networking, network security, game theory, and machine learning.
big data and cloud computing, network security, OS, modeling
and performance evaluation of information systems. He was
a visiting professor in Yale University, USA, Inria Sophia
Antipolis, France, and Inria Grenoble Rhone-Alpes, France. He
56 Big Data Mining and Analytics, March 2018, 1(1): 47-56

Li Xu is a professor and doctoral Yu Gu received the BE degree from


supervisor at the school of Mathematics University of Science and Technology of
and Computer Science at Fujian Normal China (USTC) in 2004. He received the
University. He received the BS and MS PhD degree from the same university in
degrees from Fujian Normal University in 2010. From 2006. 2 to 2006. 8, he was
1992 and 2001, respectively. He received an intern in the Microsoft Research Asia,
the PhD degree from Nanjing University Beijing, China. From 2007 to 2008, he
of Posts and Telecommunications in 2004. was visiting scholar in the University of
Now he is the director of office of informatization in Fujian Tsukuba, Japan. From 2010 to 2012, he worked in the National
Normal University and Director of Key Lab of Network Institute of Informatics (Japan) as a JSPS Research Fellow. Now
Security and Cryptography in Fujian Province. His interests he is a professor in School of Computer and Information, Hefei
include network optimization and network security, complex University of Technology, China. His research interests include
network and system, wireless communication network and pervasive computing and affective computing. He is a senior
communication, etc. Prof. Xu has been invited to act as PC member of IEEE. He received the Excellent Paper Award in the
chair or member at more than 30 international conferences. He IEEE Scalcom 2009.
is a member of IEEE and ACM, and a senior member of CCF
and CIE in China. He has published over 150 papers in refereed Yongqiang Hu received the BE degree
journals and conferences. from Northwest A&F University,
Yangling, China, in 1991. Now he is
Lihua Zhu received the BE degree from the director of Institute of Scientific
Qinghai University of Economics and and Technical Information of Qinghai
Management, Xining, China, in 1998. Province, China. His current research
Now she is with Institute of Scientific interests are in science and technology
and Technical Information of Qinghai information platform, big data analysis,
Province, China, where she is an associate rural information platform, information technology theory,
research fellow of Library Science. Her science and technology strategy.
current interests are in technical archives
development, soft science, science and technology strategic
development, science and technology information methodology.

Das könnte Ihnen auch gefallen