Sie sind auf Seite 1von 8

See

discussions, stats, and author profiles for this publication at:


https://www.researchgate.net/publication/324121894

Vulnerability Detection in Source Code Based


on Git History

Thesis · February 2018


DOI: 10.13140/RG.2.2.28338.09922

CITATIONS READS

0 45

1 author:

Kenta Yamamoto
Japan Advanced Institute of Science and Technology
1 PUBLICATION 0 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Master's Thesis View project

All content following this page was uploaded by Kenta Yamamoto on 31 March 2018.

The user has requested enhancement of the downloaded file.


Vulnerability Detection in Source Code
Based on Git History

Kenta Yamamoto <ymkjp@jaist.ac.jp> (1510756)


School of Information Science,
Japan Advanced Institute of Science and Technology

February 10, 2018

Keywords: Security, Vulnerability Management, Software Development, Version Control, Machine Learning.

Abstract Yahoo approximately breached 3 billion users’


account information in 2013[4].
This paper proposes the approach which aims To avoid these security issues and distribute
to reduce the false positive rate compared to software in proper quality, software develop-
the novel vulnerability detection methodology, ment projects widely adopt a variety of au-
called VCCFinder, introduced by Perl et al[1]. tomation approaches. However, as mentioned
In particular, this approach takes account of above, a great deal of serious vulnerabilities
added-lines and removed-lines in patch fea- has been discovered. In the meanwhile, Vul-
ture while VCCFinder does not distinguish nerability detection techniques tackle a chal-
them. Consequently, it improved the AUC lenge whose objectives are contradictory; (a)
(area under curve) of its precision-recall curve if vulnerability detection tools seek a high pre-
by 18.8% from the replication of VCCFinder . cision, it requires developers to work on addi-
Besides, to gain more profound insights from tional efforts such as annotating unsafe user in-
the experiment, this study also reveals that put as taint[5]. On the other hand, (b) if vul-
valuables consisting of words related to com- nerability detection tools discharge developers
puter resource most significantly contributed to from burdens, it simply cannot gain enough
the classification model. precision to assist development process. To
solve this severe problem, Perl et al. selected
1 Introduction Git history from which training datasets are
retrieved without sacrificing adaptability, and
The number of reported vulnerability is consid- successfully achieved high precision.
erably increasing among the released version Git is eligible for data source because it is
of software products. The vulnerabilities regis- a de facto standard for version control. Ac-
tered to CVE counted 1,020 in 2000 and raised cording to the survey by Stack Overflow, Git
to 14,643 in 2017[2]. As information technol- is adopted by 69.2% of developers out of
ogy is broadly adopted, the impact of security 30,730[6]. More importantly in this context,
incidents is getting extensive and critical. For notable OSS (open-source software) projects,
instance, Equifax, one of the largest credit bu- such as Linux Kernel, OpenSSL, FFmpeg,
reaus in the US, exposed their 143 million con- PostgreSQL, Chrome V8, and Apache HTTPD,
sumers’ data due to website application vulner- manage source code by Git as well. Conse-
ability in 2017[3]. Above all others in number, quently, there is a sufficient number of reli-

1
able security fixes which are publicly tracked prediction helps to make software reliable.
with referring CVE-IDs in their commit mes- Li et al. propose one of HVD called
sage, or the fixed commits are referred by CVE VulPecker. It firstly maps a representation for-
database. mat to each vulnerability[10]. In other words,
In this paper, we introduce that our approach VulPecker lists up the candidates of the code
performs 0.901 for the area (AUC, to be ex- similarity algorithm and let them compete per
act) of its ROC (receiver operating charac- CVE-ID. This approach works well because no
teristic) curve and 0.082 for the area of its algorithm is good at treating every single fea-
precision-recall curve, while the replication of ture, and each algorithm is good at handling
VCCFinder performs 0.894 and 0.069 respec- particular feature. For instance, ReDebug is
tively. dedicated to detecting untouched code clones,
but not able to identify clones once variables
are modified. Then, LibSVM[11] learns based
2 Related Work on the vulnerability signature which generated
by the selected algorithm and detects whether
Even focusing on static analysis by machine
the given patch contains a dangerous code. As
learning, there is a plenty number of existing
like other methodology, this also doesn’t need
research. Although spam detection, one of
any security specialist involvement until re-
the most traditional tasks in machine learning,
viewing the potentially vulnerable source code.
shares certain aspects with vulnerability detec-
Since some of the code-similarity algorithms
tion in point of text processing, source codes
depend on AST, VulPecker’s adaptability is not
have different characteristics from natural lan-
exceptionally high as well.
guages such as high frequency of word emer-
gence reserved in programming languages. As
mentioned, Perl et al. retrieve training data 3 Methodology
from Git history associated with CVE-IDs.
Comparing to Flawfinder[7], they reduced the This methodology firstly converts a set of com-
number of false alarms by over 99 % at the mit patches to Bag-of-Words model as Perl et
same level of recall[1]. al. do. Then, we can regard the vulnerability
Wang et al. parse source code into AST (ab- detection task as the analysis against S which
stract syntax tree) but not purely process source denotes a set of tokens. The S is the result of
code as text[8]. In the middle phase of its pro- lexical analysis consisting of each patch sliced
cedure, their methodology extracts nodes be- by space and line break. This study inherits the
longing to particular categories in the tree; se- formal definition of the function φ mapping a
lect method definition, method call, instantia- commit to a vector space from VCCFinder[1]
tion, and control statement, but exclude object as:
type and assignment statement. Interestingly,
it aims to predict the defection learnt based on
φ : X −→ R|S| , φ : x 7−→ (b(x, s))s∈S (1)
ready-made dataset[9]. Since their approach
depends on AST, its adaptability is not high where X is the set of all commits, and an in-
as VCCFinder. However, keeping program ap- dividual commit x ∈ X is to be embedded in
plicable to lexical analysis is not a demand- the vector space. The auxiliary function b(x, s)
ing prerequisite. Besides, since defections are returns 1 if a token s exists in x, otherwise 0.
more ambiguous than vulnerabilities generally, Our approach makes it able to query a patch
it tends to be harder to construct prediction by line types. Therefore, the difference from
model. Still, there is no doubt that the defect the prior methodology by Perl et al. is that

2
each token takes account of whether contained source code is planned to be available at
in added-line or removed-line. Expediently, https://github.com/announce/hvd. To make
name these distinctions ”LT-S” (line type sen- the experiment reliable, we adopt a variety
sitive) and ”LT-I” (line type insensitive). We of libraries including NumPy for multidimen-
extend a token set S to Es whose each to- sional matrix computing[12], SciPy for han-
ken s is suffixed by (_ADDED, _REMOVED, dling CSR (compressed sparse row)[12], uni-
_MIXED) corresponding to: diff for parsing unified-diff format[13], scikit-
learn for vectorization, Support Vector Ma-
{s : s ∈ ES , l(s) ∈ La } (2) chine (SVM), cross-validation, and metrics
calculation[14]. We also implement the repli-
{s : s ∈ ES , l(s) ∈ Lr } (3) cation of VCCFinder as LT-I to compare with
{s : s ∈ ES , l(s) ∈ (La ∪ Lr )} (4) LT-S methodology.
where F ⊆ P, H ⊆ F, L ⊆ H, La ⊆ L, Lr ⊆ Perl et al. provide a dataset of commits la-
L, all of these sets are expressed by unified-diff belled by VCC (changes containing a vulner-
format in Git, P denotes a set of patches, F de- ability) and UC (unclassified changes) and as-
notes a set of files, H denotes a set of hanks, sociated with their CVE-IDs[1]. It comprises
L denotes a set of lines, La denotes a set of 714 VCCs out of 350k commits in total from 66
added lines, Lr denotes a set of removed lines, most prominent OSS repositories implemented
and function l(s) returns a line the token s is in C/C++ such as mentioned in the introduction
extracted from. An arbitrary commit x has an section. The number of unique tokens counts
empty patch P = ∅ when the –allow-empty 170k.
option is given to Git. With this distinction To evaluate classifier model, it’s not accept-
and replacement of S with ES given to the for- able that these tools miss out vulnerable code.
mula 1, the sparse vector of a fictitious commit Therefore, the high recall (R = TpT+F p
n
) is re-
x looks like: quired; as for (Tp , Tn , Fp , Fn ), refer Table 1. To
assist development process like code review,

.
. 
however, the precision (P = TpT+F p
p
) needs
.
 1  AUTHOR_CONTRIBUTION: 0.0 to be satisfied as well. Otherwise, reviewers
  AUTHOR_CONTRIBUTION: 10.0
 0 

 .


spend half of their working hours for a report
 . 


. 
 to analyse if detected commits are truly vul-
φ(x) 7→ 


1
0



buf_write_func_ADDED
buf_write_func_REMOVED nerable. Accordingly, this type of studies con-
 1  buf_write_func_MIXED
 

 0 
 func_foo_ADDED siders the balance of precision and recall. For
func_foo_REMOVED
0
0 func_foo_MIXED
the comprehensive analysis, we calculate the
.
. area for ROC curve and precision-recall curve;
.
where ROC curve plots (x, y) = (FP R, R),
This methodology follows the rest of VC- PR curve plots (x, y) = (P, R), and FP R =
CFinder’s approach including the other feature Fp
. To make the result replicable, we pro-
Fp +Tn
engineering, learning and prediction phase.
vide the fixed seed to the pseudo-random gen-
Besides, Figure 1 illustrates the whole proce-
erator and choose the same hyper-parameters
dure briefly.
of SVM to VCCFinder.
As a result, LT-S performs 0.901 for the
4 Evaluation area of its ROC curve and 0.082 for the area
of its precision-recall curve, while LT-I per-
This study experiments the LT-S method- forms 0.894 and 0.069 respectively. It is 18.8%
ology by Python implementation, and the improvement in the areas of their precision-

3
recall curve. In the experiment, LT-S took of vulnerability reported at CVE while other
45m36s with the computer whose CPU core is static analysis against source code such as Clan
10 and RAM is 62 GB; whereas LT-I only took Static Analyser[15] and Trinity[16] focus on
17m06s. Nonetheless, the vast majority of the bugs brought by semantics and parameter val-
processing time is occupied by learning phase. idation, respectively. (c) Explain-ability: it’s
In the practical use case, the learnt model is able to explain the basis of detections as de-
dumped and shared with future predictions for scribed in the evaluation section. Nonetheless,
a while once calculated. Then, it takes a few since the precision is rather poor, the further
seconds to parse a given unknown commit and reduction of false positive rate is necessary.
perform prediction by using the shared model. Owing to the dataset provided by Perl et al.,
Hence, the execution time of learning phase we could empirically compare our methodol-
should not influence the development process. ogy with the replication of VCCFinder. Notice:
Here is a brief description what kind of fea- We’ve implemented their approach with in-
tures contribute to the classification model. tegrity as much as possible, however, note that
Figure 2 demonstrates that most contributing the reproducibility is limited since the imple-
features are valuable tokens which are rele- mentation of VCCFinder is not publicly avail-
vant to computer resources such as CPU, mem- able probably due to security concerns.
ory, and network. For instance, structors ap-
pears related to memory allocation by a com- References
plex structure. Similarly, vmalloc relates to
[1] Henning Perl, Matthew Smith, Daniel Arp, Fabian Yamaguchi,
virtual memory allocation in pages. As for Konrad Rieck, and Sascha Fahl. VCCFinder : Finding Potential
CPU, skbuffhead appears related to a spin-lock Vulnerabilities in Open-Source Projects to Assist Code. Ccs
’15, pages 426–437, 2015.
of threads. Then, there are also tokens related
[2] NIST. National Vulnerability Database.
to network; tso stands for TCP Segmentation https://nvd.nist.gov/home.cfm, 2018.
Offload, and if_ether should be a flag of Eth- [3] SEC. Exhibit99120170907. https://www.sec.gov/, 2017.
ernet availability. In addition, the figure also [4] Suzanne Philion and David Samberg. Yahoo provides notice to
additional users affected by previously disclosed 2013 data
shows most contributing valuables are added- theft. https://www.oath.com/press/yahoo-provides-notice-to-
tokens. These findings do not surprise us be- additional-users-affected-by-previously/,
cause it’s obvious that vulnerability occurs cor- 2017.
[5] David Evans and David Larochelle. Improving security using
relating closely with side effects with computer extensible lightweight static analysis. IEEE Software,
resource management and adding code. How- 19(1):42–51, 2002.
ever, it’s worth verifying that automatic detec- [6] StackExchange Inc. Stack Overflow Developer Survey 2017.
https://insights.stackoverflow.com/survey/2017, 2017.
tion approach makes no difference with the ex-
[7] David A. Wheeler. Flawfinder.
periential intuition of human. https://www.dwheeler.com/flawfinder/.
[8] Song Wang, Taiyue Liu, and Lin Tan. Automatically learning
semantic features for defect prediction. Proceedings -
5 Conclusion International Conference on Software Engineering,
14-22-May-:297–308, 2016.
[9] T Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and
Despite the fact that the features acquirable via B. Turhan. The Promise Repository of Empirical Software
Git are limited, this study improved AUC of Engineering Data. http://openscience.us/repo, 2012.
the precision-recall curve by 18.8% compared [10] Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Hanchao Qi, and
Jie Hu. VulPecker: An Automated Vulnerability Detection
to the replication of VCCFinder without los- System Based on Code Similarity Analysis. Proceedings of the
ing its remarkable adaptability and other ad- 32nd Annual Conference on Computer Security Applications,
1828:201–213, 2016.
vantages. Namely, (a) Scalability: it’s able to
[11] Chih-Chung Chang and Chih-Jen Lin. LIBSVM. ACM
process 350k commits in a reasonable time. Transactions on Intelligent Systems and Technology, 2(3):1–27,
(b) Generality: it takes account of any type 4 2011.

4
[12] Stéfan Van Der Walt, S. Chris Colbert, and Gaël Varoquaux. Blondel, Gilles Louppe, Peter Prettenhofer, Ron Weiss, Vincent
The NumPy array: A structure for efficient numerical Dubourg, Jake Vanderplas, Alexandre Passos, David
computation. Computing in Science and Engineering, Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard
13(2):22–30, 3 2011. Duchesnay. Scikit-learn: Machine Learning in Python. Journal
[13] Matias Bordese. unidiff: Unified diff parsing/metadata of Machine Learning Research, 12:2825–2830, 1 2012.
extraction library. https://pypi.python.org/pypi/unidiff.
[15] Chris Lattner. Clang Static Analyzer, 2016.
[14] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort,
Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu [16] Dave Jones. Trinity: Linux system call fuzzer.

Table 1: Prediction vs. Answer


Answer
True False
Prediction
Positive Tp (True Positive) Fp (False Positive1 )
Negative Tn (True Negative2 ) Fn (False Negative)

1 Type-1 errors
2 Type-2 errors

5
Figure 1: The overview of the proposed methodology

6
Figure 2: Top 20 × 2 contributing features (blue: positive, red: negative)

7
View publication stats

Das könnte Ihnen auch gefallen