An Extended Line-Based Approach to Detect Code Clones
Using Syntactic and Lexical Information
Kazuaki Maeda Department of Business Administration and Information Science, Chubu University, 1200 Matsumoto, Kasugai, Aichi 487-8501, Japan kaz@acm.org Abstract This paper proposes a new line-based approach for the detection of code clones using syntactic and lexical infor- mation. A customized compiler writes a source code rep- resentation that contains syntactic and lexical information. A new clone detection tool called LePalex reads the source code representation, and converts it to three types of code: rst normal form, second normal form, and third normal form. The rst normal form is used to detect the exact match of code clones. The second normal form is used to detect the syntactic match of code clones. The third normal form is used to check for syntactically correct segments of code clones. This paper demonstrates the advantage of this ap- proach in achieving programming language independence using syntactic and lexical information. 1. Introduction In the context of program development, copy and paste technique is useful during the editing of source code. The copied and pasted segment of source code is called a code clone. Code clones provide a simple means to reuse source code. If a segment of source code is already tested, it contains fewer bugs than source code that is written from scratch. The use of particular application program inter- faces (APIs) often requires ordered series of method calls to achieve the desired behaviors. For example, to develop GUI applications using the Java Swing API, similar order- ings are common with the libraries. Device drivers of oper- ating systems usually contain a large number of code clones because all the drivers have the same APIs and most of them implement a similar logic. In the case of developing a new driver for a new hardware model, cloning the existing driver avoids the risk of changing the existing driver[1]. The code clones can be troublesome during program maintenance. If an error that is found in the original source code must be xed, the code clones must also be xed. However, the presence of code clones is usually not doc- umented so they must be manually detected and then xed. Code clone detection, which automatically nds copied and pasted segments, is a key technology in resolving such a maintenance problem. Many techniques to detect code clones have been proposed in the past. Sophisticated ap- proaches used in the proposed techniques require the source code to be parsed. Paper[2] states the following: Parsing the program suite of interest requires a parser for the language dialect of interest. While this is nominally an easy task, in practice one must acquire a tested grammar for the dialect of the language at hand. This approach provides powerful code clone detection, but it requires a unique parser to be built for each target pro- gramming language. Other sophisticated approaches using parsers are not ap- propriate for the detection of code clones if language inde- pendence is required. Paper[3] states the following: Language dependency is a big obstacle when it comes to the practical applicability of duplication detection. We have thus chosen to employ a tech- nique that is as simple as possible and prove that it is effective in nding duplication. Language independence is an important aspect in the use of code clone detection in real applications, because there are numerous programming languages currently available glob- ally. The approach[3] is based on simple string matching to achieve language independence, but it provides less power- ful detection than the approaches that use parsers. This paper describes another line-based approach that detects code clones using syntactic and lexical information, and also aims to achieve programming language indepen- dence. A simple string matching algorithm is applied, but 2010 Seventh International Conference on Information Technology 978-0-7695-3984-3/10 $26.00 2010 IEEE DOI 10.1109/ITNG.2010.176 1237 the input is customized and unique from that of other code clone detection tools. Source code is converted in PALEX 1 , as is proposed in a previous paper[4]. In addition, some for- matted code containing syntactic and lexical information is generated from the PALEX code. In this paper, the tool that detects code clones is called LePalex. Section 2 describes related works dealing with code clone detection. Section 3 describes clone code detection and its input. Section 4 summarizes this paper. 2. Related works Since 1990s, many studies have been discussed code clone detection. These studies show that a signicant per- centage of source code contains code clones. A study[5] suggests that 19% of the total X Window System source code is cloned. Another paper[2] shows that the average occurrence of code clones in all subsystems is 12.7%. In an extreme case, the average occurrence of code clones is 59%[3]. Many papers related to code clone detection, typical cat- egorize techniques as follows: Line-based approach (e.g. [3, 5]) Token-based approach (e.g. [6, 7]) AST-based approach (e.g. [2]) Dependency-based approach (e.g. [8]) The different techniques detect three types of code clones[9], as follows: Type 1 is an exact copy. Some variations exist in the han- dling of white space characters and comments. Type 2 is a syntactically identical copy, in which only user- dened identiers, types, or the layout of source code may be changed. Type 3 is a copy with further modications. Statements may be changed, added, or removed. For example, two segments (lines 47 and 1114) in Fig- ure 1 are textually identical, but the execution results are different. This indicates that the two segments are code clones, and they are categorized as Type 1. In the line-based approach, all lines are compared one- for-one, which provides the advantage of programming lan- guage independence. It is important issue in real applica- tions. However, a drawback for this approach is that it ig- nores lexical and syntactic information in source code. If a developer changes the preferences for the locations of braces and partially executes the code formatter tool, the line-based approach fails to detect the code clones. In the token-based approach, the entire source code is scanned, a sequence of tokens is built, which are compared 1 PALEX stands for PArsing actions and LExical information in Xml, and is generated by modied compilers. 1: public class Calc { 2: int x=1; 3: void foo(){ 4: for(int i = 0; i < 10; i++){ 5: x = x + 2 * i + 1; 6: } 7: System.out.println(x); 8: } 9: void bar(){ 10: int x=1; 11: for(int i = 0; i < 10; i++){ 12: x = x + 2 * i + 1; 13: } 14: System.out.println(x); 15: } 16: } Figure 1. Example of code clones Found a 5 line (32 tokens) duplication in the following files: Starting at line 4 of Calc.java Starting at line 11 of Calc.java for(int i=0; i < 10; i++){ x = x + 2 * i + 1; } System.out.println(x); } Figure 2. Output of the code clone detection using CPD one-for-one. For example, if the token-based code clone detection tool CPD[7] reads the source code shown in Fig- ure 1, it detects that the two segments are code clones and generates the output shown in Figure 2. Interestingly, the last token in the output is a brace signifying the end of the method. The brace is a syntactically meaningless token for code clone detection, This is a limitation because it ignores syntactic information in the source code. CCFinder[6] preprocesses source code using syntactic information to x the problem. In addition, it recognizes certain syntax patterns to detect a high level of code clones. However, it is dependent on the target programming lan- guage. If this technique must be applied to other program- ming languages, costly code porting will be required. In the AST-based approach, syntax sensitive analysis de- tects code clones precisely. Generally, a compiler constructs an AST in the syntax analysis phase to represent syntactic information in the source code. By modifying the compiler or building a syntax analyzer from scratch, the AST can be derived from the source code. The author and his colleagues are now developing a 1238 static analyzer containing Control Flow Graph (CFG) and Data Flow Graph (DFG) analysis for Java programs. The total size of the source code is more than one hundred thou- sands lines of code. We can obtain more precise results related to code clones if the CFG and DFG analysis is ap- plied to detect them. However, if the code clone detection is applied to other programming languages, excessive de- velopment cost will be required. All approaches have their advantages and disadvantages. This paper proposes a line-based approach, but the input is different from the original source code. Instead, the in- put is customized code generated from the PALEX source code representation. The tool to detect code clones using the approach is called LePalex. This tool is independent of programming languages and is available to detect code clones using syntactic and lexical information. In addition, excessive development cost is not required. 3. Code clone detection 3.1. A line-based approach using the Karp Rabin string matching algorithm The detection tool, LePalex, uses the KarpRabin string matching algorithm[10]. This algorithm, which uses hash- ing to nd subsequences in text, is conceptually simple and easy to implement. LePalex takes a sequence of lines and calculates the hashing value. Once the lines are inserted in a hash table, only the lines in the same entry need to be compared for code clone detection. After creating PALEX source code representation, LeP- alex generates three types of code from the PALEX code. In this paper, they are referred to as rst normal form, second normal form, and third normal form. 3.2. First normal form The rst normal formis source code in which indentation and redundant white spaces are stripped from the original source code. Figure 3 shows an example of the rst normal form created from the source code in Figure 1. It is easy to convert from PALEX code to the rst normal form, because the PALEX code includes all the lexical information. If we want to detect Type 1 code clones and variations with dif- ferent layout of the source code, the rst normal form can be useful. 3.3. Second normal form The second normal form is generated code in which all identiers are moved to the start of the line and the symbol 1:public class Calc{ 2:int x=1; 3:void foo(){ 4:for(int i=0;i<10;i++){ 5:x=x+2 * i+1; 6:} 7:System.out.println(x); 8:} 9:void bar(){ 10:int x=1; 11:for(int i=0;i<10;i++){ 12:x=x+2 * i+1; 13:} 14:System.out.println(x); 15:} 16:} Figure 3. Example of rst normal form 1:$public,Calc:$ class ${ 2:$int,x:$ $=1; 3:$void,foo:$ $(){ 4:$int,i,i,i:for($ $=0;$<10;$++){ 5:$x,x,i:$=$+$ * 2+1; 6: } 7:$System,out,println,x:$.$.$($); 8: } 9:$void,bar:$ $(){ 10:$int,x:$ $=1; 11:$int,i,i,i:for($ $=0;$<10;$++){ 12:$x,x,i:$=$+2 * $+1; 13: } 14:$System,out,println,x:$.$.$($); 15: } 16: } Figure 4. Example of second normal form $ is embedded at the original location. The moved identi- ers are placed between the symbol $ and : at the start of each line. At the rst character position of each line, $ signies that the line contains identiers and the space character signies that the line does not contain any identiers. We can easily recover from the second normal form to the rst normal form if each $ is replaced with the original identier and each space character in the rst position is deleted. Figure 4 shows an example of the second normal form created from source code in Figure 1. LePalex skips the identiers between $ and : when it calculates hash val- ues for the KarpRabin string matching algorithm, After all the lines in the source code are inserted in the hash table, each entry in the table is examined to detect code clones. In addition, LePalex nds continuous segments from the code clones and combines them into larger segments. 1239 1:$public,Calc:$ <sft fr="16" to="55"/><rdc st ="55" ln="1"/><rdc st="64" ln="1"/>class <rdc st="63" ln="1"/><sft fr="106" to="228"/>$<sft fr="365" to="347"/>{ 2:$int,x:<rdc st="283" ln="2"/><sft fr="979" to="1015"/>$ <sft fr="1095" to="94"/><rdc st=" 94" ln="1"/><rdc st="111" ln="1"/>$<rdc st="238 " ln="2"/><sft fr="1133" to="220"/>=<sft fr=" 220" to="356"/>1<sft fr="356" to="151"/><rdc st ="151" ln="1"/><rdc st="188" ln="1"/><rdc st=" 170" ln="1"/>; Figure 5. Example of third normal form at line 12 Even if only user-dened identiers are changed, LeP- alex detects the code clones. For example, if the identier x is replaced with y, line 12 is converted to $y,y,i:$=$+2 * $+1; In this case, the statements (lines 5 and 12) after the iden- tier part between $ and : are the same. LePalex cal- culates the same hash values for the lines and checks the identier pattern, determining that they are cloned lines. 3.4. Third normal form The third normal form is a combination of the second normal form and syntactic information in the form of an XML document. Figure 5 shows an example of the third normal form composed from lines 1 and 2 in Figure 4. The third normal form preserves all text images of the second normal form and syntactic information is embedded in it. Figure 5 resembles a sequence of XML tags. However, the second normal form can be obtained from the third normal form if all XML tags are stripped away. When LePalex calculates the hash value for matching, it skips the identier part between $ and : and ignores all XML tags. After the entire source code is inserted in the hash table, each entry in the table is examined to de- tect code clones. In addition, LePalex nds continuous seg- ments from the code clones and combines them into larger segments. Finally, it decomposes the code clones into com- plete syntactic segment using XML tags. 4. Conclusion This paper describes a line-based approach for the detec- tion of code clones using syntactic and lexical information. A customized compiler writes both syntactic and lexical in- formation. LePalex reads the information and converts it to three types of code; rst normal form, second normal form, and third normal form. The rst normal form is used to de- tect Type 1 code clones. The second normal form is used to detect Type 2 code clones. The third normal form is used to check for syntactically correct segments of code clones. To verify the validity of this paper, the source code of commercial products is currently being analyzed using this approach. The results of this analysis will be published in a future paper. References [1] Cory Kapser and Michael W. Godfrey, Cloning Con- sidered Harmful Considered Harmful, Working Con- ference on Reverse Engineering, pp.1928, 2006. [2] Ira D. Baxter, Andrew Yahin, et al., Clone Detection Using Abstract Syntax Trees, International Conference on Software Maintenance, pp.368-377, 1998. [3] St ephane Ducasse, Matthias Rieger, and Serge De- meyer, A Language Independent Approach for Detect- ing Duplicated Code, 15th IEEE International Confer- ence on Software Maintenance, pp.109118, 1999. [4] Kazuaki Maeda, XML-Based Source Code Represen- tation with Parsing Actions, The International Confer- ence on Software Engineering Research and Practice, pp.715720, 2007. [5] Brenda .S. Baker, On nding duplication and near- duplication in large software systems, Working Confer- neceo on Reverse Engineering, pp.8695, 1995. [6] Toshihiro Kamiya, Shinji Kusumoto, and Katsuro In- oue, CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code, IEEE Transactions on Software Engineering, pp.654 670, vol.28, no.7, 2002. [7] PMD: Finding copied and pasted code, available from http://pmd.sourceforge.net/cpd.html. [8] Jens Krinke, Identifying Similar Code with Program Dependence Graphs, Working Conference on Reverse Engineering, pp.301309, 2001. [9] Stefan Bellon, Rainer Koschke, et al., Comparison and Evaluation of Clone Detection Tools, IEEE Transac- tions on Software Engineering, pp.577591, vol.33, no.9, 2007. [10] Richard M. Karp and Michael O. Rabin, Efcient ran- domized pattern-matching algorithms, IBM Journal of Research and Development, vol.31, no.2, 1987. 1240