Sie sind auf Seite 1von 12

Android Malware Detection and

Classification Using
Machine Learning Techniques

Submitted in fulfilment of seminar required for the


Bachelor of Technology
Computer Science and Engineering
by
Abhishek Kumar
2016UGCS023

Department of Computer Science and Engineering

National Institute of Technology, Jamshedpur

Autumn Semester 2019


Contents
I. Acknowledgment.
II. Abstract.
III. Introduction.
IV. Motivation.
V. Methodology.
VI. Classifier.
VII. Performance Evaluation.
VIII. Results.
IX. Conclusion.

Acknowledgment
I would like to express my special thanks and gratitude to my
professors for giving this golden opportunity to do this
seminar on the topic ”Android Malware Detection and
Classification using Machine Learning Techniques” and also
for guiding me by suggestions and constructive critics which
also helped me in getting a good knowledge about machine
learning and Computer Security.

- Abhishek Kumar (2016UGCS023)

Abstract
Malware is very dangerous in today’s world for the internet
users. Nowadays, malware designed by attackers are
generally polymorphic in nature. Polymorphic malware is a
type of malware that constantly changes its identifiable
features in order to fool detection using typical signature-
based models. Opcode frequency based malware detection
evaluates the malware based on the frequency of opcodes in
the disassembled file. We want to obtain the frequency of
the opcodes which can be obtained from static analysis, after
that, we can apply different machine learning models in
order to detect whether it is a malware or not, or classify it to
know malware families. In this report, I will discuss opcode
based detection method and how we can apply different
machine learning techniques in order to detect and classify
malwares in an android application.
Introduction
We all know the importance of the internet in our
lives. It has grown rapidly in the recent decades.
With this trend, there are also a large number of
hackers and terrorist those having an intent of
doing crimes are creating malware. Also, with a
large number of tools available nowadays the
amount of skills required to create a new malware
is decreasing rapidly.
Malware can be defined as any type of malicious
code that has the potential to harm a computer or
a network. Modern malware is designed with
mutation characteristics, namely polymorphism
and metamorphism, which causes an enormous
growth in the number of variants of malware
samples. Malware pose a great challenge in our
day-to-day life. With their ever increasing number,
it has become absolutely necessary for us to find
an efficient method to get rid of them. The
effectiveness of the existing anti-malware has
reduced significantly after the introduction of
polymorphism in the computer world.
Motivation
Malware pose a great challenge in our day-to-day life.
With their ever increasing number, it has become
absolutely necessary for us to find an efficient method
to get rid of them. The effectiveness of the existing anti-
malware has reduced significantly after the introduction
of polymorphism in the computer world. Polymorphism
encrypts the code of the viruses, thus changing their
signatures too. To understand how this affects the
whole working of anti-malware softwares, you need to
know how they work, anti-malware have a database of
virus signatures, which gets updated regularly and
whenever they encounter a file, they check if the
signature of this file is in their database or not. If it is,
then it is treated as a virus, otherwise it is treated as a
clean file.
Methodology
 A disassembler converts the codes of suspicious to
assembly codes.
 Disassembled codes contain useless contents like
comments, labels, empty lines, etc. which is removed.
Then, another tool that is called “OPCODE STATISTICS
EXTRACTOR (OSE)” calculates the frequency of opcodes.
 Extracted information in previous step is the input of a
classifier which classifies the app as malware or benign.
Classifier
 Fix 256 distinct opcodes that can be used in an android
app were investigated.
 For the purpose of classification, 2 different classifiers
were used - Random Forest and Support Vector
Machine (SVM).
 Several training and testing dataset were used : 70-30,
80-20, 90 -10 and 10 fold cross validation. Thus, the
most accurate training and testing dataset distribution
also the best classifier can be found.

Performance Evaluation
 Relevant confusion matrices were created from the
response of classifiers.
 For assessing the proposed system, following gauges are
introduced :
 True positive (TP): the number of malwares that are
correctly detected as malware.
 True Negative (TN): the number of healthy files that
are correctly detected as healthy.
 False Positive (FP): the number of healthy files that
are incorrectly detected as malware.
 False Negative (FN): the number of malwares that
are incorrectly detected as healthy.
 The calculation of True Positive Rate (TPR), False Positive
Rate (FPR) and Accuracy are as follows :

𝑇𝑃
True positive rate = 𝑇𝑃+𝐹𝑁
𝐹𝑃
False positive rate = 𝐹𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁
Total accuracy = 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁

𝑇𝑃
Precision = 𝑇𝑃+𝐹𝑃

𝑇𝑃
Recall = 𝑇𝑃+𝐹𝑁

2∗𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
F1 = (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑅𝑒𝑐𝑎𝑙𝑙)
Results

Algorithm Training and Testing

Precision Recall F-Score

SVM 0.952 0.951 0.951

Random Forest 0.937 0.935 0.935


Conclusion
I would like to conclude this by saying that further research is
needed in this area of Malware detection and classification
since internet is reaching more and more people every day.
The malware creation is becoming simple day by day.
Here, I attempted to propose a method for detecting the
malwares based on the analysis of opcodes statistics. Results
revealed that detecting the malwares through statistical
analysis could be a vigorous technique.
I showed this using one technique that machine learning can
be of great help in defending us from malwares.

Das könnte Ihnen auch gefallen