Sie sind auf Seite 1von 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/331759331

Evaluating Machine Learning Models for Android Malware Detection: A


Comparison Study

Conference Paper · December 2018


DOI: 10.1145/3301326.3301390

CITATION READS
1 139

3 authors:

Md. Shohel Rana Charan Gudla


University of Southern Mississippi University of Southern Mississippi
16 PUBLICATIONS   12 CITATIONS    3 PUBLICATIONS   1 CITATION   

SEE PROFILE SEE PROFILE

Andrew H. Sung
University of Southern Mississippi
12 PUBLICATIONS   8 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Deepfake Video Detection and Prevention View project

Building a Learning Machine Classifier with Inadequate Data for Crime Prediction View project

All content following this page was uploaded by Md. Shohel Rana on 20 March 2019.

The user has requested enhancement of the downloaded file.


Evaluating Machine Learning Models for Android Malware
Detection – A Comparison Study
Md. Shohel Rana Charan Gudla Andrew H. Sung
Ph.D. Student Ph.D. Student Professor
School of Computing Sciences and School of Computing Sciences and School of Computing Sciences and
Computer Engineering Computer Engineering Computer Engineering
University of Southern Mississippi University of Southern Mississippi University of Southern Mississippi
Hattiesburg, MS 39406, United States Hattiesburg, MS 39406, United States Hattiesburg, MS 39406, United States
+1 (929) 331-7300 +1 (804) 928-4768 +1 (714) 454-6203
md.rana@usm.edu charan.gudla@usm.edu andrew.sung@usm.edu

ABSTRACT code, resources, and AndroidManifest.xml file by providing the


Android is the most popular mobile operating system having information of an application‟s features and the security
billions of active users worldwide that attracted advertisers, configurations such as the permissions API, activities, services,
hackers, and cybercriminals to develop malware for various content providers and the broadcast receivers [2]. We studied the
purposes. In recent years, wide-ranging researches have been AndroidMenifest.xml file to check the permission used and then
conducted on malware analysis and detection for Android devices API functions are written to call in java file to check whether
while Android has also implemented various security controls to any .dex executable (ELF) image file or any code hiding image
deal with the malware problems, including unique user ID (UID) script available or not after decompiling of an Android APK file.
for each application, system permissions, and its distribution During the last couple of years, a huge number of methods of
platform Google Play. In this paper, we optimize and evaluate machine learning applied in Android devices to classify the
different types of machine learning algorithms by implementing a malware into families and to identify new malware families by
classifier based on static analysis in order to detect malware in inferring the clustering and classification techniques.
applications running on the Android OS. In our evaluation, we use
11,120 applications with 5,560 malware samples and 5,560 In this paper, we evaluate different types of machine learning
benign samples of the DREBIN dataset, and the accuracy we methods for solving the classification problem to detect malware
achieved is higher than 94%; therefore, the study has directly on the Android devices by performing static analysis on
demonstrated the effectiveness of using machine learning DREBIN dataset [3]. The dataset is composed of malware and
classifiers for detecting Android malware. benign data and each file contains numerous features of requested
hardware components, requested permissions, app components,
CCS Concepts network addresses and so on. The results of our experiment are
• Computing methodologies → Machine learning. assessed based on various features includes api_call, feature, url,
service_receiver, permission, call, intent, real_permission, activity,
Keywords provider. Analyzing the outcomes of this experimentation using
Machine Learning, Malware, Classifier, Optimization, DREBIN, these algorithms we found that, the ensembled based learning
Google Play. algorithms had (e.g. Random Forest, Decision Tree, Extra
Randomized Tree, etc.) performed better by obtaining more than
1. INTRODUCTION 94% accuracy in the classification prediction. Finally, the
Android is an open source mobile-based operating system built on outcomes of this experiment lead to design and development of
the Linux Kernel and its architecture is divided into five new real time applications or tools for the prediction to detect
components with two models of permissions; (i) A sandbox malicious app on Android devices.
environment at the kernel level which prevent access to the file-
The paper is organized as follows: In Section 2 describes the
system and other resources and (ii) API used to expose to the user
overview of mobile malware and thread analysis, Section 3
during installation of an application [1].
describes the related works, Section 4 describes proposed
The assembly of every Android application consists of application malware detection system architecture, Section 5 describes
numerous tree based algorithms which we have used in our
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are experiment, Section 6 describes methodology includes dataset
not made or distributed for profit or commercial advantage and that copies description, feature extraction, technology used, training and
bear this notice and the full citation on the first page. Copyrights for testing, section 7 describes results and analysis includes
components of this work owned by others than ACM must be honored. measurement metrices, parameters and comparison and
Abstracting with credit is permitted. To copy otherwise, or republish, to optimization of experimentation, and finally Section 8 describes
post on servers or to redistribute to lists, requires prior specific permission conclusion and future works.
and/or a fee. Request permissions from Permissions@acm.org.
ICNCC 2018, December 14–16, 2018, Taipei City, Taiwan 2. OVERVIEW OF MOBILE MALWARE
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-6553-6/18/12…$15.00 AND THREAT ANALYSIS
https://doi.org/10.1145/3301326.3301390 According to malicious behaviors, mobile malware can be
separated into numerous classes. The first approach automatically

17
installs malware on mobile devices using different approaches like Recent studies of the use of machine learning algorithms with
worms, where the second approach takes advantage of the user static and dynamic analysis for malware detection include:
drawing attention to install apps (e.g., adware) manually. Here, we
figure out some mobile malware based on their malicious In [7], the authors evaluated several supervised machine learning
behaviors, as shown in Table 1 [4]. algorithms by implementing a static analysis framework to make
predictions for detecting malware on Android.
In addition, we briefly describe some mobile malware attack types
and intention to extract the useful features which can be used as Rana, et al. [8] proposed a new substring-based feature selection
indication during the process of malware detection [5]-[6]. method by evaluating four tree-based machine learning algorithms
for detecting Android malware. In the experiment 11,120 apps of
Hardware-based attacks: In hardware-based attacks, attackers the DREBIN dataset were used where 5,560 contain malware
use specific commands or operations to crash hardware or insert samples and the rest are benign. We found that the Random Forest
firmware or change hardware preciously to make it abnormal or classifier outperformed the best previously reported result (around
even malicious. The action normally occurred by creating a 94% accuracy, obtained by SVM) with 97.24% accuracy.
backdoor, intercepting data, manipulating hardware and inserting
firmware or Cloning hardware and services. In still another paper [9], a stacked generalization concept on the
tree-based machine learning algorithms was implemented for
Software-based attacks: In software-based attacks, attackers detecting malware on Android in conjunction with a substring-
upload android apps to the third-party mobile app market for based method for training the algorithms using the DREBIN
download by adjusting malware into official and benign apps. The dataset, and it obtained 98% accuracy which was better than
action normally occurred by intercepting data, creating a botnet or previous experiment results.
gaining pecuniary benefits.
Talha, et al. [10] proposed a permission-based detection system
Firmware-based attacks: In firmware-based attacks, attackers „APK Auditor‟ to classify the Android apps as benign or
modify or change programs of devices by obtaining control malicious and obtained 88% accuracy with a 0.925 specificity
privilege or creating backdoors so that they can crash or take using 8762 applications containing 1853 benign applications and
control of the system. 6909 malicious applications.
Table 1. Mobile malware based on malicious behaviors Sahs and Khan [11] proposed a supervised machine learning
technique to detect malware on Android using SVM, and where
Category Threats „Androguard‟ tool and the Scikit-learn framework are first used to
Adware Bundled with unknown software via pop- extract information from the APKs [12].
up ads or doing commercial
advertisements without the permission of DroidDolphin [13] is a dynamic analysis framework based on
users; machine learning to detect malware on Android. It performs
Scareware An app works as a faked antivirus analysis by extracting information from API calls and 13 activities
misleading a user to pay for or download by running the application on virtual environments, and achieved
contents from networks; a precision of 86.1% and an F-score of 0.875 by using SVM and
Mobile Spyware Spy on any actions of mobile device users; the LIBSVM library [14].
Mobile Trojans Mobile ecosystem adaptive Trojans to
achieve malicious mobile-based goals, 4. MACHINE LEARNING ALGORITHMS
such as Banking Trojans and SMS This section describes the machine learning algorithms that have
Trojans; been used in our experiment includes:
Mobile Virus Propagate malicious programs by adapting  Decision Tree (DT): Builds classification models in the
to a mobile cellular environment; form of a tree structure by breaking down a dataset
Mobile Worm Exploit weakness of an app or a system, (categorical and/or numerical data) into smaller subsets and
Cabir worm spreads using the Bluetooth makes the final result as a tree with decision nodes having
feature of wireless phones; two or more branches where the leaf nodes represent a
classification or decision and the topmost decision node can
3. RELATED WORKS be represented as root node in a tree corresponds to the best
A study is essential to select most appropriate machine learning
predictor [15].
algorithm for the same problem, because, not all algorithms work
 Random Forest (RF): A supervised learning algorithm
efficiently to classify an app as malicious or benign (see figure. 1).
trained with the Bagging method that builds multiple
decision trees and merges them together to get a more
accurate and stable prediction [16].
 Extremely Randomized Tree (ERT): With respect to
random forests, this method drops the idea of using bootstrap
copies of the learning model without trying to find an
optimum cut-point for each one of the K randomly chosen
features at each node [17].
 Gradient Boosted (GB): An ensemble machine learning
algorithm works for both regression and classification
problems using boosting technique, combining a number of
weak learners in order to form a strong learner [18].
 Support Vector Machine (SVB): A discriminative classifier
Figure 1. Overview of malware detection techniques defined by a separating hyperplane and the algorithm outputs

18
an optimal hyperplane which categorizes new examples. In Table 2. Top malware families of dataset (values above 40)
2D space this hyperplane is a line dividing a plane in two
parts where in each class lay in either side [19]. Malware family #Entries Malware family #Entries
 Neural Networks (NN-MLP): A feedforward artificial FakeInstaller 925 Adrd 91
neural network model that consists of multiple layers where
each of the layer is fully connected to the next layer and DroidKungFu 667 DroidDream 81
maps sets of input data onto a set of appropriate outputs. The ExploitLinuxL
Plankton 625 70
nodes of the layers are neurons using nonlinear activation otoor
functions, except for the nodes of the input layer. There can Opfake 613 Glodream 69
be one or more non-linear hidden layers between the input
and the output layer [20]. Ginmaster 339 MobileTx 69
 Naïve Bayes (NB): A supervised classification algorithm for
solving binary or multi-class classification problems for BaseBridge 330 FakeRun 61
predictive modeling by calculating the probabilities for each Iconosys 152 SendPay 59
factor and then selecting the outcome with highest
probability using Bays theorem [21]. Kwin 147 Gappusin 58
 k-Nearest Neighbors (k-NN): In a classification task, the
output is calculated as the class with the highest frequency FakeDoc 132 Imlog 43
from the K-most similar instances where each instance votes Geinimi 92 SMSreg 41
for their class and the class having most votes is taken as the
prediction and the class probabilities are calculated as the
normalized frequency of samples that belong to each class in
5.2 Technology Used
the set of K most similar instances for a new data instance The experiment has been done using Python in Anaconda package
[22]. and the machine configuration includes the System Model:
MacBookPr011.5, Processor: Intel(R) Core (TM) i7-4870HQ
 Discriminant Analysis (DA): Discovers a set of prediction
CPU @ 2.50GHz (8 CPUs), ~2.5GHz 64-bit PC with RAM:
comparisons based on independent variables that are used to
16GB.
classify entities into groups with having two possible
objectives includes: (i) finding a predictive equation for 5.3 Training and Testing
classifying new individuals, (ii) interpreting the predictive In order to obtain good result, we divide the dataset into training
equation to understand the relationships exists among and testing set and perform the classification task by applying
variables [23]. Decision Tree, Random Forest, Extremely Randomized Tree,
 Logistic Regression (LR): The appropriate regression Gradient Boosting, Bagging, Support Vector Machine, Neural
analysis is conducted when the dependent variable (target) is Networks, Naïve Bayes, k-Nearest Neighbors, K-Means,
categorical (e. g. to predict whether an email is spam 1 or 0) Discriminant Analysis, Logistic Regression classifiers. In order to
and used to describe data used when the dependent variable train a classifier, we provide the feature vectors found from the
(target) is categorical and to explain the relationship between previous computation along with the class labels of each training
one dependent binary variable and one or more nominal, instance. After training the classification model we evaluate the
ordinal, interval or ratio-level independent variables [24]. accuracy by relating its classification to the innovative target
 Bagging (BAGG): An ensemble-based learning method that malware detector's classifications for the same new instances.
combines the predictions from multiple machine learning
algorithms together to make more accurate predictions than 6. RESULTS AND ANALYSIS
any individual model [25].
 K Means (KMN): The k-means algorithm searches for a
6.1 Measurement Metrics
Confusion Matrix: A confusion matrix (see Table 3) is a matrix
pre-determined number of clusters within an unlabeled
which contains information about actual and predicted
multidimensional dataset [26].
classifications to measure the performance of algorithm using the
matrix data [27].
5. METHODOLOGY Table 3. Confusion Matrix
5.1 Dataset Description
In our experiment we have used „DREBIN‟ dataset 11,120 of Actual Class
123,453 real Android applications from 179 different malware
Positive Negative
families where 5,560 applications for real malware samples and
5,560 applications for real benign samples. The samples were True Positive False Positive
Positive
collected in the period of August 2010 to October 2012. An Predictive (TP) (FP)
overview of the top 20 malware families in our dataset is provided Class False Negative True Negative
Negative
in Table 2 including numerous feature families: api_call, feature, (FN) (TN)
url, service_receiver, permission, call, intent, real_permission,
activity, provider. Note that only the top 20 families are exposed Accuracy (AC) is the proportion of the total number of corrected
and our dataset contains number of entries in each class of predictions. Overall, how often is the classifier correct?
malware (values above 40).
( )

19
Precision (P) is the proportion of the correctly predicted positive models like logistic regression, so we need less training data and
cases determined by less time for training the model even if the Naïve Bayes
assumption doesn‟t hold. K-Means classifier works better for
clustering problem. According to our experiment, the main
difference of performances between Naive Bayes (NB) and
Recall or True Positive rate (TP) is the proportion of the Random Forest (RF) are their model size while Naive Bayes
correctly identified positive cases defined by model size is low and quite constant with respect to the data and
NB models cannot represent complex behavior so it won‟t get into
over fitting. On the other hand, Ensembled based learning (e.g.
Random Forest, Extremely Randomized Tree, etc.) model size is
False Positive rate (FP) is the proportion of negatives cases that very large and if not carefully built, it results to over fitting. So,
were incorrectly classified as positive, defined by when the data is dynamic and keeps changing, NB can adapt
quickly to the changes and new data while using a RF that would
have to rebuild the forest every time something changes. In our
experiment, the overall accuracy of each machine learning
f1- Score or F-Measure is a weighted average of the True algorithm using the same parameters can be summarized the
Positive (TP) rate or recall and Precision (P) defined by results in Figure. 2 and Figure. 3, where it has been seen that the
( ) best performance is obtained by Random Forest (the optimization
of bagging classifier of Ensembled techniques) using the features
of Hardware Components, Requested Permissions, App
Where has a value from 0 to infinity and is used to control the Components, Filtered Intents, Restricted API Calls, Used
weight assigned to TP and P. Permissions, Suspicious API Calls, and Network Address. This
ROC Curve is a graph to summarize the performance of the produces the Random Forest‟s overall accuracy of 94.33%, TPR
classifier over all probable thresholds generated by plotting the of 94.27%; FPR of 5.88%, and AUC of 99.21%. We can consider
True Positive (TP) Rate in Y-axis against the False Positive (FP) the “best” outcome as the experiment that maximizes the
Rate in X-axis. difference between the TPR and the FPR since we have achieved
higher TPR and lower FPR.
6.2 Performance Results
Finally, we observe that the ensembled based learning algorithms
(e.g. Random Forest, Extremely Randomized Tree, Bagging,
Boosting, Decision Tree) performs better compared with others
learning algorithms (see Table 4).
Table 4. Result of the performance of algorithms

Algorith Precision Recall f1-score Accura


ms 0 1 0 0 1 cy

SVM 0.91 0.91 0.91 0.91 0.91 0.91 90.74

RF 0.95 0.94 0.94 0.95 0.94 0.94 94.33

LR 0.78 0.84 0.86 0.75 0.82 0.80 80.94

NB 0.68 0.52 0.15 0.92 0.25 0.66 53.60


Figure 2. Accuracy
MLP 0.85 0.91 0.92 0.83 0.88 0.87 87.54

DT 0.93 0.88 0.88 0.94 0.91 0.91 91.78

ERT 0.93 0.93 0.93 0.93 0.93 0.93 93.66

BAGG 0.94 0.94 0.94 0.94 0.94 0.94 93.71

GB 0.87 0.88 0.89 0.86 0.88 0.87 87.50

k-NN 0.92 0.89 0.89 0.92 0.90 0.91 90.47

KMN 0.54 0.50 0.13 0.88 0.21 0.64 50.40

DA 0.76 0.88 0.91 0.70 0.83 0.78 80.71

We also see that Naïve Bayes and K-Means classifiers performed


very poor. Naive Bayes mainly performs well when we have
multiple classes and for text classification, because it‟s simple and
if the conditional independence assumption actually holds, a
Naïve Bayes classifier will converge quicker than discriminative Figure 3. ROC curves

20
7. CONCLUSION AND FUTURE WORKS [11] Sahs, J. and Khan, L. 2012. A machine learning approach to
Detecting mobile malware has become an important problem due android malware detection, Intelligence and Security
to the rapid worldwide explosion of mobile devices; accordingly, Informatics Conference (EISIC), IEEE, pp. 141-147.
several datasets of Android malware have been created for [12] scikit-learn: machine learning in Python scikit-learn 0.16.1
research and analysis. With respect to the DREBIN dataset. In this documentation, link: http://scikit-learn.org/stable/, last
paper, we studied the problem on the same dataset using several accessed: 2018/10/29
machine learning algorithms and performed our experimentation
[13] Wu, W. C. and Hung, S. H. 2014. DroidDolphin: a dynamic
on these algorithms. Most of the previous researches on Android
Android malware detection framework using big data and
malware used only two common features: system permission and
machine learning, Proc. of 2014 Conference on Research in
API call, but in our experimentation, we have used 8 features and
Adaptive and Convergent Systems, ACM, pp. 247-252.
observe the variations of parameters and their effect on the
accuracy of detection. [14] Chang, C. C. and Lin, C. J. 2011. LIBSVM: A library for
support vector machines. ACM Transactions on Intelligent
For future work, we propose to study new feature extraction and Systems and Technology, vol. 2, issue. 3, Article 27, pp. 1-
selection techniques, the combinational effects of feature sets and 27. DOI: https://doi.org/10.1145/1961189.1961199.
learning machines, ensemble methods, and anomaly detection
techniques for identifying new malware. [15] Decision Tree - Classification, link: https://www.saed-
sayad.com/decision_tree.htm, last accessed: 2018/10/29.
8. REFERENCES [16] Towards Data Science | The Random Forest Algorithm, link:
[1] Drake, J. J., Lanier, Z., Mulliner, C., Fora, P. O., Ridley, S. A. https://towards-datascience.com/the-random-forest-
and Wicherski, G. 2014. Android Hacker‟s Handbook, John algorithm-d457d499ffcd, last accessed: 2018/10/29.
Wiley & Sons, Indianapolis.
[17] Geurts, P., Ernst, D. and Wehenkel, L. 2006. Extremely
[2] Peiravian, N. and Zhu, X. 2013. Machine learning for randomized trees, Machine Learning, 63(1), pp. 3-42.
android malware detection using permission and API calls.
[18] A Comprehensive Guide to Ensemble Learning, link:
In Tools with Artificial Intelligence (ICTAI), Proc. of 2013
https://www.analyticsvidhya.com/blog/2018/06/comprehensi
IEEE 25th International Conference, pp. 300-305.
ve-guide-for-ensemble-models/, last accessed: 2018/10/29.
[3] Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H. and
[19] Towards Data Science | Support Vector Machine  -
Rieck, K. 2014. DREBIN: Effective and Explainable
Introduction to Machine Learning Algorithms, link:
Detection of Android Malware in Your Pocket, NDSS, USA.
https://towardsdatascience.com/support-vector-machine-
[4] Yan, P. and Yan, Z. 2017. A Survey on Dynamic Mobile introduction-to-machine-learning-algorithms-934a444fca47,
Malware Detection, Springer, New York, USA. last accessed: 2018/10/29.
[5] Grace, M., Zhou, Y., Zhang, Q., Zou, S. and Jiang, X. 2012. [20] Neural Networks with Scikit, link: https://www.python-
Riskranker: scalable and accurate zero-day android malware course.eu/neural-networks-with-scikit.php,
detection, Proc. of International Conference on Mobile last accessed: 2018/10/29.
Systems, Applications, and Services (MOBISYS), pp. 281-
[21] Naive Bayes for Machine Learning, link: https://machine-
294.
learningmastery.com/naive-bayes-for-machine-learning/, last
[6] Qiao, M., Sung, A. H. and Liu, Q. 2016. Merging Permission accessed: 2018/10/29.
and API Features for Android Malware Detection, Proc. of
[22] K-Nearest Neighbors for Machine Learning, link:
IIAI International Congress on Advanced Applied
https://machinelearningmastery.com/k-nearest-neighbors-for-
Informatics (IIAI-AAI), Kumamoto, Japan, 2016, pp. 566-
machine-learning/, last accessed: 2018/10/29.
571. DOI: 10.1109/IIAI-AAI.2016.237.
[23] Discriminant Analysis, link: https://ncss-wpengine.netdna-
[7] Rana, M. S. and Sung, A. H. 2018. Malware Analysis on
ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/-
Android using Supervised Machine Learning Techniques,
Discriminant_Analysis.pdf, last accessed: 2018/10/29.
International Journal of Computer and Communication
Engineering vol. 7, no. 4, pp. 178-188, DOI: [24] Towards Data Science | Logistic Regression  - Detailed
10.17706/ijcce.2018.7.4.178-188. Overview, link: https://towards-datascience.com/logistic-
regression-detailed-overview-46c4da4303bc, last accessed:
[8] Rana, M. S., Rahman, S. S. M. M. and Sung, A. H. 2018.
2018/10/29.
Evaluation of Tree Based Machine Learning Classifiers for
Android Malware Detection. In: Nguyen N., Pimenidis E., [25] Bagging and Random Forest Ensemble Algorithms for
Khan Z., Trawiński B. (eds) Computational Collective Machine Learning, link:
Intelligence. ICCCI 2018. Lecture Notes in Computer https://machinelearningmastery.com/bagging-and-random-
Science, vol. 11056. Springer, Cham. DOI: forest-ensemble-algorithms-for-machine-learning/,
https://doi.org/10.1007/978-3-319-98446-9_35. last accessed: 2018/10/29.
[9] Rana, M. S., Gudla, C. and Sung, A. H. 2018. Android [26] Vander P. J. 2016. Python Data Science Handbook -
Malware Detection using Stacked Generalization, 27th Essential Tools for Working with Data, O‟Reilly.
International Conference on Software Engineering and Data [27] Confusion Matrix, link: http://www2.cs.uregina.ca/dbd/-
Engineering. (In press) cs831/notes/confusion-matrix/confusion-matrix.html,
[10] Talha, K. A., Alper, D. I. and Aydin, C. 2015. APK Auditor: last accessed: 2018/10/29.
Permission-based Android malware detection system, Digital
Investigation, ELSEVIER, vol. 13, June 2015, pp. 1-14.

21

View publication stats

Das könnte Ihnen auch gefallen