Sie sind auf Seite 1von 4

An Extended Line-Based Approach to Detect Code Clones

Using Syntactic and Lexical Information


Kazuaki Maeda
Department of Business Administration
and Information Science, Chubu University,
1200 Matsumoto, Kasugai, Aichi 487-8501, Japan
kaz@acm.org
Abstract
This paper proposes a new line-based approach for the
detection of code clones using syntactic and lexical infor-
mation. A customized compiler writes a source code rep-
resentation that contains syntactic and lexical information.
A new clone detection tool called LePalex reads the source
code representation, and converts it to three types of code:
rst normal form, second normal form, and third normal
form. The rst normal form is used to detect the exact match
of code clones. The second normal form is used to detect
the syntactic match of code clones. The third normal form
is used to check for syntactically correct segments of code
clones. This paper demonstrates the advantage of this ap-
proach in achieving programming language independence
using syntactic and lexical information.
1. Introduction
In the context of program development, copy and paste
technique is useful during the editing of source code. The
copied and pasted segment of source code is called a code
clone.
Code clones provide a simple means to reuse source
code. If a segment of source code is already tested, it
contains fewer bugs than source code that is written from
scratch. The use of particular application program inter-
faces (APIs) often requires ordered series of method calls
to achieve the desired behaviors. For example, to develop
GUI applications using the Java Swing API, similar order-
ings are common with the libraries. Device drivers of oper-
ating systems usually contain a large number of code clones
because all the drivers have the same APIs and most of them
implement a similar logic. In the case of developing a new
driver for a new hardware model, cloning the existing driver
avoids the risk of changing the existing driver[1].
The code clones can be troublesome during program
maintenance. If an error that is found in the original source
code must be xed, the code clones must also be xed.
However, the presence of code clones is usually not doc-
umented so they must be manually detected and then xed.
Code clone detection, which automatically nds copied
and pasted segments, is a key technology in resolving such
a maintenance problem. Many techniques to detect code
clones have been proposed in the past. Sophisticated ap-
proaches used in the proposed techniques require the source
code to be parsed. Paper[2] states the following:
Parsing the program suite of interest requires a
parser for the language dialect of interest. While
this is nominally an easy task, in practice one
must acquire a tested grammar for the dialect of
the language at hand.
This approach provides powerful code clone detection, but
it requires a unique parser to be built for each target pro-
gramming language.
Other sophisticated approaches using parsers are not ap-
propriate for the detection of code clones if language inde-
pendence is required. Paper[3] states the following:
Language dependency is a big obstacle when it
comes to the practical applicability of duplication
detection. We have thus chosen to employ a tech-
nique that is as simple as possible and prove that
it is effective in nding duplication.
Language independence is an important aspect in the use of
code clone detection in real applications, because there are
numerous programming languages currently available glob-
ally. The approach[3] is based on simple string matching to
achieve language independence, but it provides less power-
ful detection than the approaches that use parsers.
This paper describes another line-based approach that
detects code clones using syntactic and lexical information,
and also aims to achieve programming language indepen-
dence. A simple string matching algorithm is applied, but
2010 Seventh International Conference on Information Technology
978-0-7695-3984-3/10 $26.00 2010 IEEE
DOI 10.1109/ITNG.2010.176
1237
the input is customized and unique from that of other code
clone detection tools. Source code is converted in PALEX
1
,
as is proposed in a previous paper[4]. In addition, some for-
matted code containing syntactic and lexical information is
generated from the PALEX code. In this paper, the tool that
detects code clones is called LePalex.
Section 2 describes related works dealing with code
clone detection. Section 3 describes clone code detection
and its input. Section 4 summarizes this paper.
2. Related works
Since 1990s, many studies have been discussed code
clone detection. These studies show that a signicant per-
centage of source code contains code clones. A study[5]
suggests that 19% of the total X Window System source
code is cloned. Another paper[2] shows that the average
occurrence of code clones in all subsystems is 12.7%. In
an extreme case, the average occurrence of code clones is
59%[3].
Many papers related to code clone detection, typical cat-
egorize techniques as follows:
Line-based approach (e.g. [3, 5])
Token-based approach (e.g. [6, 7])
AST-based approach (e.g. [2])
Dependency-based approach (e.g. [8])
The different techniques detect three types of code
clones[9], as follows:
Type 1 is an exact copy. Some variations exist in the han-
dling of white space characters and comments.
Type 2 is a syntactically identical copy, in which only user-
dened identiers, types, or the layout of source code
may be changed.
Type 3 is a copy with further modications. Statements
may be changed, added, or removed.
For example, two segments (lines 47 and 1114) in Fig-
ure 1 are textually identical, but the execution results are
different. This indicates that the two segments are code
clones, and they are categorized as Type 1.
In the line-based approach, all lines are compared one-
for-one, which provides the advantage of programming lan-
guage independence. It is important issue in real applica-
tions. However, a drawback for this approach is that it ig-
nores lexical and syntactic information in source code. If
a developer changes the preferences for the locations of
braces and partially executes the code formatter tool, the
line-based approach fails to detect the code clones.
In the token-based approach, the entire source code is
scanned, a sequence of tokens is built, which are compared
1
PALEX stands for PArsing actions and LExical information in Xml,
and is generated by modied compilers.
1: public class Calc {
2: int x=1;
3: void foo(){
4: for(int i = 0; i < 10; i++){
5: x = x + 2
*
i + 1;
6: }
7: System.out.println(x);
8: }
9: void bar(){
10: int x=1;
11: for(int i = 0; i < 10; i++){
12: x = x + 2
*
i + 1;
13: }
14: System.out.println(x);
15: }
16: }
Figure 1. Example of code clones
Found a 5 line (32 tokens) duplication in the
following files:
Starting at line 4 of Calc.java
Starting at line 11 of Calc.java
for(int i=0; i < 10; i++){
x = x + 2
*
i + 1;
}
System.out.println(x);
}
Figure 2. Output of the code clone detection
using CPD
one-for-one. For example, if the token-based code clone
detection tool CPD[7] reads the source code shown in Fig-
ure 1, it detects that the two segments are code clones and
generates the output shown in Figure 2. Interestingly, the
last token in the output is a brace signifying the end of the
method. The brace is a syntactically meaningless token for
code clone detection, This is a limitation because it ignores
syntactic information in the source code.
CCFinder[6] preprocesses source code using syntactic
information to x the problem. In addition, it recognizes
certain syntax patterns to detect a high level of code clones.
However, it is dependent on the target programming lan-
guage. If this technique must be applied to other program-
ming languages, costly code porting will be required.
In the AST-based approach, syntax sensitive analysis de-
tects code clones precisely. Generally, a compiler constructs
an AST in the syntax analysis phase to represent syntactic
information in the source code. By modifying the compiler
or building a syntax analyzer from scratch, the AST can be
derived from the source code.
The author and his colleagues are now developing a
1238
static analyzer containing Control Flow Graph (CFG) and
Data Flow Graph (DFG) analysis for Java programs. The
total size of the source code is more than one hundred thou-
sands lines of code. We can obtain more precise results
related to code clones if the CFG and DFG analysis is ap-
plied to detect them. However, if the code clone detection
is applied to other programming languages, excessive de-
velopment cost will be required.
All approaches have their advantages and disadvantages.
This paper proposes a line-based approach, but the input
is different from the original source code. Instead, the in-
put is customized code generated from the PALEX source
code representation. The tool to detect code clones using
the approach is called LePalex. This tool is independent
of programming languages and is available to detect code
clones using syntactic and lexical information. In addition,
excessive development cost is not required.
3. Code clone detection
3.1. A line-based approach using the Karp
Rabin string matching algorithm
The detection tool, LePalex, uses the KarpRabin string
matching algorithm[10]. This algorithm, which uses hash-
ing to nd subsequences in text, is conceptually simple and
easy to implement. LePalex takes a sequence of lines and
calculates the hashing value. Once the lines are inserted in
a hash table, only the lines in the same entry need to be
compared for code clone detection.
After creating PALEX source code representation, LeP-
alex generates three types of code from the PALEX code. In
this paper, they are referred to as rst normal form, second
normal form, and third normal form.
3.2. First normal form
The rst normal formis source code in which indentation
and redundant white spaces are stripped from the original
source code. Figure 3 shows an example of the rst normal
form created from the source code in Figure 1. It is easy to
convert from PALEX code to the rst normal form, because
the PALEX code includes all the lexical information. If we
want to detect Type 1 code clones and variations with dif-
ferent layout of the source code, the rst normal form can
be useful.
3.3. Second normal form
The second normal form is generated code in which all
identiers are moved to the start of the line and the symbol
1:public class Calc{
2:int x=1;
3:void foo(){
4:for(int i=0;i<10;i++){
5:x=x+2
*
i+1;
6:}
7:System.out.println(x);
8:}
9:void bar(){
10:int x=1;
11:for(int i=0;i<10;i++){
12:x=x+2
*
i+1;
13:}
14:System.out.println(x);
15:}
16:}
Figure 3. Example of rst normal form
1:$public,Calc:$ class ${
2:$int,x:$ $=1;
3:$void,foo:$ $(){
4:$int,i,i,i:for($ $=0;$<10;$++){
5:$x,x,i:$=$+$
*
2+1;
6: }
7:$System,out,println,x:$.$.$($);
8: }
9:$void,bar:$ $(){
10:$int,x:$ $=1;
11:$int,i,i,i:for($ $=0;$<10;$++){
12:$x,x,i:$=$+2
*
$+1;
13: }
14:$System,out,println,x:$.$.$($);
15: }
16: }
Figure 4. Example of second normal form
$ is embedded at the original location. The moved identi-
ers are placed between the symbol $ and : at the start
of each line. At the rst character position of each line,
$ signies that the line contains identiers and
the space character signies that the line does not contain
any identiers.
We can easily recover from the second normal form to the
rst normal form if each $ is replaced with the original
identier and each space character in the rst position is
deleted.
Figure 4 shows an example of the second normal form
created from source code in Figure 1. LePalex skips the
identiers between $ and : when it calculates hash val-
ues for the KarpRabin string matching algorithm, After all
the lines in the source code are inserted in the hash table,
each entry in the table is examined to detect code clones. In
addition, LePalex nds continuous segments from the code
clones and combines them into larger segments.
1239
1:$public,Calc:$ <sft fr="16" to="55"/><rdc st
="55" ln="1"/><rdc st="64" ln="1"/>class <rdc
st="63" ln="1"/><sft fr="106" to="228"/>$<sft
fr="365" to="347"/>{
2:$int,x:<rdc st="283" ln="2"/><sft fr="979"
to="1015"/>$ <sft fr="1095" to="94"/><rdc st="
94" ln="1"/><rdc st="111" ln="1"/>$<rdc st="238
" ln="2"/><sft fr="1133" to="220"/>=<sft fr="
220" to="356"/>1<sft fr="356" to="151"/><rdc st
="151" ln="1"/><rdc st="188" ln="1"/><rdc st="
170" ln="1"/>;
Figure 5. Example of third normal form at line
12
Even if only user-dened identiers are changed, LeP-
alex detects the code clones. For example, if the identier x
is replaced with y, line 12 is converted to
$y,y,i:$=$+2
*
$+1;
In this case, the statements (lines 5 and 12) after the iden-
tier part between $ and : are the same. LePalex cal-
culates the same hash values for the lines and checks the
identier pattern, determining that they are cloned lines.
3.4. Third normal form
The third normal form is a combination of the second
normal form and syntactic information in the form of an
XML document. Figure 5 shows an example of the third
normal form composed from lines 1 and 2 in Figure 4. The
third normal form preserves all text images of the second
normal form and syntactic information is embedded in it.
Figure 5 resembles a sequence of XML tags. However, the
second normal form can be obtained from the third normal
form if all XML tags are stripped away.
When LePalex calculates the hash value for matching,
it skips the identier part between $ and : and ignores
all XML tags. After the entire source code is inserted in
the hash table, each entry in the table is examined to de-
tect code clones. In addition, LePalex nds continuous seg-
ments from the code clones and combines them into larger
segments. Finally, it decomposes the code clones into com-
plete syntactic segment using XML tags.
4. Conclusion
This paper describes a line-based approach for the detec-
tion of code clones using syntactic and lexical information.
A customized compiler writes both syntactic and lexical in-
formation. LePalex reads the information and converts it to
three types of code; rst normal form, second normal form,
and third normal form. The rst normal form is used to de-
tect Type 1 code clones. The second normal form is used to
detect Type 2 code clones. The third normal form is used to
check for syntactically correct segments of code clones.
To verify the validity of this paper, the source code of
commercial products is currently being analyzed using this
approach. The results of this analysis will be published in a
future paper.
References
[1] Cory Kapser and Michael W. Godfrey, Cloning Con-
sidered Harmful Considered Harmful, Working Con-
ference on Reverse Engineering, pp.1928, 2006.
[2] Ira D. Baxter, Andrew Yahin, et al., Clone Detection
Using Abstract Syntax Trees, International Conference
on Software Maintenance, pp.368-377, 1998.
[3] St ephane Ducasse, Matthias Rieger, and Serge De-
meyer, A Language Independent Approach for Detect-
ing Duplicated Code, 15th IEEE International Confer-
ence on Software Maintenance, pp.109118, 1999.
[4] Kazuaki Maeda, XML-Based Source Code Represen-
tation with Parsing Actions, The International Confer-
ence on Software Engineering Research and Practice,
pp.715720, 2007.
[5] Brenda .S. Baker, On nding duplication and near-
duplication in large software systems, Working Confer-
neceo on Reverse Engineering, pp.8695, 1995.
[6] Toshihiro Kamiya, Shinji Kusumoto, and Katsuro In-
oue, CCFinder: A Multilinguistic Token-Based Code
Clone Detection System for Large Scale Source Code,
IEEE Transactions on Software Engineering, pp.654
670, vol.28, no.7, 2002.
[7] PMD: Finding copied and pasted code, available from
http://pmd.sourceforge.net/cpd.html.
[8] Jens Krinke, Identifying Similar Code with Program
Dependence Graphs, Working Conference on Reverse
Engineering, pp.301309, 2001.
[9] Stefan Bellon, Rainer Koschke, et al., Comparison and
Evaluation of Clone Detection Tools, IEEE Transac-
tions on Software Engineering, pp.577591, vol.33,
no.9, 2007.
[10] Richard M. Karp and Michael O. Rabin, Efcient ran-
domized pattern-matching algorithms, IBM Journal of
Research and Development, vol.31, no.2, 1987.
1240

Das könnte Ihnen auch gefallen