Beruflich Dokumente
Kultur Dokumente
1 Introduction
Advanced Persistent Threat (APT) attack is an attack method for the purpose
of theft of confidential information or destruction of system about a particular
organization or company. APT attack usually uses malware called Remote Access
Trojan (RAT) which can steal the confidential information from a target orga-
nization. After RAT’s intrusion through APT attacks, the attacker can monitor
and control the victim’s PC remotely, to wait for an opportunity to steal the con-
fidential information. There are three main intrusion ways of RATs: Email, USB
memory and Drive-by-download attack. It is difficult for even an administrator
to perceive such attacks [4]. In APT attacks the conventional entrance counter-
measure is difficult to detect since the security software can seldom detect RATs.
Due to an increasing number of advanced attacks that cannot be prevented by
only entrance measures, the exit measures have become important.
c Springer International Publishing AG 2016
F. Bao et al. (Eds.): ISPEC 2016, LNCS 10060, pp. 110–121, 2016.
DOI: 10.1007/978-3-319-49151-6 8
A Host-Based Detection Method of Remote Access Trojan 111
2 Related Work
a method to detect the malware of APT attack using the system behavior of the
normal program. Chandran et al. [1] propose a method to detect APT malware
by focusing on “changes” in the behavior on a host. It uses information such
as CPU usage, memory usage, the number of files in system32 folder and open
ports. Mimura and Sasaki [8] propose a method to log suspicious communications
of RATs using the communication and process information.
There are several network-based RAT detection methods [5,6,10,11]. These
methods focus attention on the difference in the communication characteristic of
RATs and normal applications. Jiang and Omote [5] propose a RAT detection
method which uses seven network features and five kinds of supervised machine
learning algorithms in the early stage. However, FNR of this method is slightly
high. Li et al. [6] propose a RAT detection method which uses the network infor-
mation from a SYN packet to a FIN/RST packet to extract network behavior
features. However, sensitive information may be leaked at the end of TCP net-
work connection. Yamada et al. [11] detects illegal activities of RAT spying in
the Intranet. Yamauchi et al. [10] propose a method of detecting C&C traffic in
Bot-nets using the characteristics of standard deviation of access interval time.
There are also some studies of RAT detection methods which combine Host- and
Network-based approaches [3,12].
3 Preliminary
Machine learning is largely divided into the supervised and unsupervised learn-
ing. The supervised learning uses the correct input-output pairs as training data.
The purpose of supervised learning is to obtain a correct output against the input
data. On the other hand, the purpose of unsupervised learning is to find the reg-
ularity from input data. Our research has a classification advantage that the
answer exists, hence we use the supervised learning algorithms.
3.3 Cross-Validation
We use the K-Fold Cross validation (CV) technique to evaluate the detection
model created by the machine learning techniques. The advantage using K-
Fold CV is to evaluate the detection of RATs which are not used for training.
A Host-Based Detection Method of Remote Access Trojan 113
Fig. 1. Image of traffic data size in the network traffic of the early stage [5].
Accordingly, this evaluation considers the detection of unknown RATs. The eval-
uation parameters Accuracy, FPR and FNR are calculated based on the following
formula:
True Detection Number
Accuracy = (1)
Total Number
False Detection Number of ‘Positive’
FPR = (2)
Legitimate Sample Number
False Detection Number of ‘Negative’
FNR = (3)
RAT Sample Number
Definition 1 (Early Stage [5]). The Early Stage of a session is a packet list
which satisfies conditions as follows:
Feature Description
PacNum Packet number Per session
OutByte Outbound data size
InByte Inbound data size
OutPac Outobound packet number
InPac Inbound packet number
O/Ibyte Rate of OutByteInByte
O/Ipac Rate of OutPacInPac
OB/OP Rate of OutByteOutPac
IB/IP Rate of InByteInPac
DstIP Destination IP number Per process
Conn Connection number
The traffic of normal applications is usually greater than that of RATs during
the early stage. In other words, RATs always trend to behave secretly in order to
hide themselves in the Intranet as long as possible. The reason is that the large
amount of traffic can be easier to be discovered by some usual countermeasures.
Comparing with RATs, normal applications do not need to hide their network
behavior. Moreover, normal applications leverage multi-thread techniques in pur-
suit of a high communicate speed. Figure 1 shows the image of traffic data size
in both directions during the early stage.
Our method is composed of three steps; (1) the feature extraction phase, (2)
the learning phase, and (3) the detection phase. The feature extraction phase
collects network features from packets in each process on a host during the
early stage, and then calculates the feature vectors. In the learning phase, each
vector is marked as normal or RAT in order to train the detection model by the
supervised machine learning algorithms. Finally, new packets will be classified
into a normal or RAT class based on the detection model in the detection phase.
The following describes the details of the three phases.
We define a pair of the source and destination IP addresses as one session,
and also define a set of source IP address, source port number, destination IP
address, and destination port number as one connection as described in [5,6].
1. Initialization
(PacNum=OutByte=OutPac=InByte
=InPac=DstIP=Conn=0 and
O/Ibyte=O/Ipac=OB/OP=IB/IP=NULL)
3. Process Identification
4. Session Identification
of such 11 features. The first 9 kinds of features are obtained in each session and
the rest of 2 kinds of features are obtained in each process. We can collect them
from the first 58 bytes of TCP packets in the early stage. The feature vector,
which uses the network features in each process, has a 11-dimensional one as
described in Table 1. We deal with the session in each process by linking the
process and session, in which a process can have multiple-session. Our method
assumes that the process is identifiable on a host.
Figure 2 shows the calculation algorithm of feature vector in feature extrac-
tion phase. DstIP and Conn are calculated through the entire running processes
and the rest of features are calculated in each session. Our method uses the
total number of DstIP and Conn in the process. The calculation steps of feature
vector is as follows.
7. If the interval time between this packet and the previous packet in the certain
session exceeds the threshold t, it goes to no. 8 to terminate the early stages.
Otherwise, it repeats from no. 2 to no. 7 during the early stage.
8. Calculate O/Ibyte, O/Ipac, OB/OP and IB/IP of the relevant session after
the early stage is finished.
9. Generate a feature vector for the session using the above calculated features.
This can be executed for all running sessions at once in batches. We assume that
one process can generate plural sessions for network communication.
(2) Learning Phase. We construct a detection model using the feature vectors
extracted by the feature extraction phase. Our detection model learns the fea-
ture vectors of normal applications and RATs by using the supervised machine
learning algorithms. We add normal/abnormal labels to the feature vector for
learning, where labels have a value of 0 or 1. The normal application stands
for the label of 0, and RAT stands for the label of 1. This method can clas-
sify whether the target communication is normal application or RAT by our
detection model. The final output of this phase is a detection model.
A Host-Based Detection Method of Remote Access Trojan 117
(3) Detection Phase. Our method generates a feature vector from the present
communication and system information on a host. Then, the detection model
predicts the label of new session. The input is a feature vector and the output is 0
or 1. If the output from the detection model is 0, then the target session is judged
as a normal communication. Otherwise, it is judged as RAT communication.
5 Evaluation
5.1 Purpose
We evaluate the performance of our proposed RAT detection method. In our
experimental evaluation, we perform 5-Fold cross-validation using six machine
learning algorithms, and verify whether our method is effective for RAT detec-
tion. We also evaluate the effective feature for RAT detection in the early stage.
Name PacNum OutByte InByte OutPac InPac O/Ibyte O/Ipac OB/OP IB/IP DesIP Conn
Nuclear 7 343 170 4 3 2.02 1.33 85.75 56.67 1 1
BitTorrent 21 3688 240 17 4 15.37 4.25 216.94 60.00 9 10
5.3 Procedure
Preprocess. The traces of RAT samples are executed in a closed network envi-
ronment. On the other hand, the traces of the normal application samples are
collected on our laboratory network. We collect the process and network infor-
mation of RATs and the normal applications on a host. Our method finds only
the process IDs which communicate with the externals. This means that the
process which does not communicate is disregarded in our evaluation. 20 RATs
have 20 sessions in 20 process and 12 normal applications have 38 sessions in 12
processes, and thus we collect the data of 58 sessions in combined 32 processes.
Time
6 Discussion
6.1 Preliminary Experiment
We performed the preliminary experiments to determine the threshold of the
early stage. More precisely, it performed an experiment changing the threshold
value from t = 1 to 25, using Random Forest as a representative algorithm. The
results of this experiment are shown in Figs. 3 and 4. We found that t = 1 is the
best from the viewpoints of Accuracy and FNR. We use one second as an early
stage time in the experiments of our method.
FPR
FNR
Time
On the other hand, the accuracy of our method is 96.5 % together with FNR of
0.0 % and FPR of 5.4 % by Naive Bayes. Although the employed algorithms are
different, we find that FNR has fallen considerably with high accuracy. There-
fore, we succeeded in lowering FNR by newly using the process information on
a host.
6.3 Evasion
If all the features could be evaded by the malware authors with no cost or
no risk, the method’s effectiveness should be questioned. If RAT behaves like
normal communication in the early stage, it may be able to evade the detection
by our method. This is a limitation of our method. However, such customized
RAT does not have the inherent feature that a RAT tries to hide its own trace
of communication. This is a disadvantage for RAT since its trace or evidence
increases. It may make the detection by the other approaches easy. RATs behave
as secretly as possible so that it cannot be found by a network administrator or
the user. We guess that RATs do not try to behave like normal communication
and tend to hide its own trace or evidence of communication in the early stage.
7 Conclusion
In this study, we proposed a host-based RAT detection method using the features
of the early stage. We performed experiments using six kinds of machine learning
algorithms. In the results of our experiment, we obtained the detection accuracy
of 96.5 % together with FNR of 0.0 % and FPR of 5.4 % by Naive Bayes (NB).
Thus, compared with our previous network-based method, our proposed method
has succeeded in lowering FNR and FPR with high accuracy. Detection in the
early stage was showed to be effective from the experimental results. Future work
will include increasing the number of RAT samples in order to properly evaluate
our method.
A Host-Based Detection Method of Remote Access Trojan 121
References
1. Chandran, S., Hrudya, P., Poornachandran, P.: An efficient classification model
for detecting advanced persistent threat. In: The International Conference on
Advances in Computing, Communications and Informations (ICACCI 2015), pp.
2001–2009 (2015)
2. Das, N., Sarkar, T.: Survey on host and network based intrusion detection system.
Int. J. Adv. Netw. Appl. 6(2), 2266–2269 (2014)
3. Friedberg, I., Skopik, F., Settanni, G., Fiedler, R.: Combating advanced persistent
threats: from network event correlation to incident detection. Comput. Secur. 48,
35–57 (2015)
4. Information-Technology Promotion Agency, Japan, “10 Major Security Threats
2015” (2015)
5. Jiang, D., Omote, K.: A RAT detection method based on network behaviors of the
communication’s early stage. IEICE Trans. Fundam. E99–A(1), 145–153 (2016)
6. Li, S., Yun, X., Zhang, Y., Xiao, J., Wang, Y.: A general framework of Trojan
communication detection based on network traces. In: The 7th International Con-
ference on Networking, Architecture and Storage (NAS 2012), pp. 49–58 (2012)
7. Moon, D., Pan, S.B., Kim, I.: Host-based intrusion detection system for secure
human-centric computing. J. Supercomput. 72(7), 2520–2536 (2015)
8. Mimura, S., Sasaki, R.: Method for estimating unjust communication cause using
network packets associated with process information. In: The International Con-
ference on Information Security and Cyber Forensics (InfoSec 2014) (2014)
9. Liang, Y., Peng, G., Zhang, H., Wang, Y.: An unknown Trojan detection method
based on software network behavior. Wuhan Univ. J. Nat. Sci. 18(5), 369–376
(2013)
10. Yamauchi, K., Kawamoto, J., Hori, Y., Sakurai, K.: Extracting C&C traffic by
session classification using machine learning. In: The 7th Workshop Among Asian
Information Security Labs (WAIS) (2014)
11. Yamada, M., Morinaga, M., Unno, Y., Torii, S., Takenaka, M.: RAT-based mali-
cious activities detection on enterprise internal networks. In: The 10th International
Conference for Internet Technology and Secured Transactions (ICITST 2015), pp.
321–325 (2015)
12. Zeng, Y., Hu, X., Shin, K.G.: Detection of botnets using combined host- and
network-level information. In: IEEE/IFIP International Conference on Depend-
able Systems and Networks (DSN 2010), pp. 291–300 (2010)