Beruflich Dokumente
Kultur Dokumente
INTRODUCTION
1.1General
Programming assignments are an intrinsic part of many courses. Such assignments
require students to produce solutions in different languages including Java. Staff
members need to verify the assignments submitted by the students. The evaluation of
these assignments requires more time and need human intervention at each every level
of correction. Moreover, sometimes students produce a copy of some other students
work. Locating similarities in large sets of submissions is an arduous task.
A solution to this problem would be an automated plagiarism detection system
narrowing down the set of all submissions to a few suspicious files to be considered
carefully.
1.2 Motivation
Plagiarism detection software available is built to detect the content copied from the
web. Aside from stand alone programs, there also exist a number of online
applications that can be used to aid the detection problem. JPlag is one such system
which requires the user to submit their files online. This system allows the user to
process a great deal of data. All these applications are implemented by using only one
algorithm which produces weaknesses in detecting plagiarism.
This project delivers a tool which is used for detection of plagiarism in student
software submission. Four algorithms has been used for design of this tool thus
reducing the vulnerability in detecting plagiarism.
1
1.4 Requirements
2
CHAPTER 2
RELATED WORK
2.1 Literature Survey
There are many tools which are already available for detection which are
implemented on different algorithms. A summary of several algorithms is provided.
Common features of the different plagiarism detection algorithms are described.
Ethical and administrative issues involving detected plagiarism are discussed.
Programs that automatically grade student programs have been common for more
than a decade. A newer development is the automatic detection of plagiarism in
student programs. A number of algorithms to detect plagiarism can be found in the
technical literature.
1) Remove all comments.
2) Ignore all blanks and extra lines, except when needed as delimiters.
3) Perform a character string compare between the two files using UNIX utilities, dif,
grep, and wc.
4) Maintain a count of the percentages of characters which are the same. This measure
is called character correlation.
3
CHAPTER 3
3.1 Architecture:
The following four sections outline the four key modules which are required to build
our fully working system. These include the processor, the algorithms, the graphical
user interface and the visualisations. Each module contains a large number of classes
in which each class builds upon the others to produce the best results and adhere to
the most efficient coding practices. Each module is loosely coupled so if any changes
were required in the communication protocols they were relatively simple to
implement. Figure.1 given below outlines the architecture of these key components
and their relations.
4
3.2 Detailed Design:
3.2.1 Processor:
The processor is contained in a shared package with the algorithm classes and consists
of Java class files. Its primary function is to take a set of parameters when a
comparison request is initialised and to process all submission source files efficiently.
It is required to act as a central hub for the breakdown and comparison of files. To
achieve this, the processor must constantly communicate with the algorithms to
ensure all result sets are built and stored correctly. The processor classes also have to
constantly communicate with the GUI to ensure that the user is being updated on
current progress.
There are three key stages involved in the operation of the processor classes. The first
stage is to pre-process all the submitted source files into a format the application can
read. To achieve this, all submitted source files are read into the pre-comparator object
and modified appropriately. Once the pre-comparator has finished running against
each individual submission directory, a unique object is created. Following
completion of this stage, all the individual objects are stored collectively.
Stage two involves iterating over the object collection and passing the objects in pairs
to the root algorithm class. The final stage involves the storage of all collected data
for subsequent access by the GUI. This stage also involves the generation of groups
Please note, the above three stages are key to any scan run by the user on a students
data set but the processor is not limited to these files. Classes such as the load and
save and the system log are also available via the processor.
5
common changes made by possible plagiarists to disguise another solution as their
own. To remove these common methods of plagiarism, the pre-comparator iterates
over the students submission directory. It then collects all .java submission files and
stores the contents in memory. The storing of file contents ensures that the original
file system is left intact. Only once the file collection is complete are the files checked
to see if they are allowed to be used. There are a number of options available to the
user to limit specific filenames or allow all files to be processed. This includes
allowing, or disallowing files linked from other documents via import, extends,
implements or other means.
It should be noted that the status flag is required to define the type of object. Its flag
consists of one of four simple characters. An ‘S’ indicates a current student
submission, a ‘P’ indicates a previous submission, an ‘I’ states that the file is being
used for an individual comparison and results should not be passed to the results class
and an ‘N’ indicates the student did not submit any files and comparisons should not
take place.
The most complex of the pre-comparator options was the replace identifiers method.
The replacement of identifiers has commonly been a technique to detect the most
basic form of plagiarism; that is students changing variable names within their source
code. To ensure replacement was achieved accurately the method is passed a list of
keywords to remove from its strings. All lines are tokenized and each token is
compared to the given list. If no match is found, the occurrence in the string is
recognised as an identifier and removed.
Despite the pre-comparator appearing simple in its nature, its replacement methods
are far from it and required a high degree of complexity and planning.
6
3.2.1.3. The comparator.
The comparator is class key to the success of the project. Its sole purpose is simple yet
vital. It is required to iterate over all the PlagData objects. If one of these PlagData
objects includes the flag ‘N’ it can inform the results class no submission was entered,
else, it can pass the two current PlagData objects to the root algorithm class for
comparison.
The results from the algorithm comparisons are passed back through the comparator
to the results class in the form of an ArrayList for recording. Following each cycle of
recording information, the comparator updates the status panel and the progress bar in
the GUI. This in turn keeps the user updated on the progress of the current scan.
The x and y axis in the three dimensional array are required to point to a specific user.
The z axis denotes which of the three algorithms is represented by an (x,y) co-
ordinate. The array is initialised to the correct number of PlagData objects upon
completion of the pre-comparator process and construction of the results class. All
cells contained in the 3D array are set to “0” at creation and populated following
successful completion of a comparison by the algorithm class. The comparator returns
an ArrayList following a successful comparison and the results class interprets this by
taking the first element and putting this into the first dimension in the Array. It then
repeats the same process for the second and third elements putting them in their
corresponding array positions.
The results class was created to act as a storage medium that can be accessed and
written to from any class throughout the project. It follows a strict interface so that
only valid data can be written at any time. It is accessed from many classes including
7
many GUI elements (such as the Graphs and the Stats tabs), the load and save features
(contained within the processor environment) and the comparator.
3.2.2 Graphical User Interface
The GUI is comprised of five key components, created over six classes that combine
to create the finished layout. The GUI has been designed in such a way to not only
break up the overall complexity but also to facilitate the arrangement of coding of
various parts individually (often by different developers). In addition this allowed
components to be tested separately (Unit Testing) before it was fully integrated within
the GUI and tested using Integration Testing methods. The main frame itself holds
together the entire structure and components are added to this frame separately. See
figure 2 to understand the basic layout. The separate components labeled in figure 2
8
are all created within separate classes. The Menu bar and toolbar were designed this
way to allow interaction with other GUI components. The status panel was again
designed as a stand-alone component to allow interaction from other classes, such as
the processor and the algorithms. When the program loads the only tab presented to
the user is the main options tab. Once a user has selected their chosen options and run
a scan, additional tabs are added to the tabbed pane.
9
3.2.2.3. Drop Target Panes. The drop target pane was designed in a very
methodological way. The design is both simple and robust allowing other classes to
call the drop target pane and create more than one instance of the class. It was
designed in this way in order to allow more than one drop target pane in the review
panel. When a file is dragged into either of the drop panes, methods are called within
those two separate instances to allow the drag and drop facility to be as robust but as
easy to use as possible.
3.2.2.4. Groups
The Group class was implemented in much the same way as the drop pane target was
in the fact that it allows more than one instance of itself to be created and used within
different parts of the GUI. The adaptability of such components allows for easy
integration into the GUI and reduces replicated code. For example both the review tab
and the stats tab use an instance of the Group class. The groups are built around a tree
structure and are placed within a scroll pane for ease of use and hierarchal navigation.
3.2.3 Visualizations
All visualization classes are stored within the GUI package. The classes that
complement visualization include the following:
- The RawDataTab is used to show the results from each of the algorithms.
- The ReviewTab is used to show side by side comparison of the transformed files
with line highlighting.
- The GraphTab converts the raw data into several Graphs to show the user any trends
identified. The visualization also uses another package from a freeware source. This is
called JFreeChart and is used solely by the GraphTab.
3.2.3.1 Graphs.
Graphs are used in this application to display trends in the data. The JFreeChart
package was used to implement the graph features. The choice of JFreeChart was a
recommendation from the Java forum and suited the needs of the project. The
package has over two thousand styles of graph allowing a high degree of
customisation. We were therefore able to choose the most appropriate graphs.
10
The graphs within the Autoplag application obtain data in the form of a 3D array from
the results class. This array is then broken down into three 2D arrays which are then
used to populate the three individual histograms.
3.3 Implementation
11
These algorithms are implemented by using JAVA. The detailed java equivalent
codes for above specified algorithms are gives as follows:
12
3.3.2Algorithm2: Common Words
13
3.3.3 Algorithm3: Common String
This algorithm can be considered as level 3 plagiarism detection. Sometimes copying
can occur as given below
File 1:
int i =1;
int j=2;
k=i+j;
File 2:
int j=2;
int i=1;
k=j+i;
In the above example the code of file2 is generated by slightly transforming the code
of file1. In such cases this algorithm can detect such kind of transformations and
translations.
The entire file or program is divided into strings of data. Each string can be obtained
by separating the file by ‘;’. These strings are placed in an array. Similarly another file
is divided into strings and another array is formed. A string of one array is compared
with every other string in the other array. The numbers of matches are then computed.
The number of matches to the total number of strings gives the percentage of
plagiarism between the two files.
This algorithm can be implemented as follows:
Input: Two files
Output: Matching percentage
Method:
1. Read array st1, st2
2. Match←0
3. J←0
4. n1 ← length of file1
5. n2 ← length of file1
6. for k:=0 to n1 do
Begin
7. for m:=0 to n2 do
14
Begin
8. if strlen1<=strlen2then
9. if st1[k]=st2[m]then
10. match++;
11. if strlen1>if strlen2then
12. if st1[m]=st1[k] then
13. match++;
End
End
15
9. if st1[k]=st2[m]then
10. match++;
11. if strlen1>if strlen2then
12. if st1[m]=st1[k] then
13. match++;
End
End
16
CHAPTER 4
For the specified four algorithms with n input files, an NxN matrix is generated as
shown in the following figures
Figure 3
In the above figure three files are compared such that each file is compared with other
files and the matrix is generated. Figure 4 shows the result obtained in a graphical
form. In the figure 4 it is observed that file 1 when compared with file 3 gives 100%
percent match, but when file 3 is compared with file 1 the percentage is below 100
This indicates that file 3 contains more information from file1. This that either one of
the files has been created using the other.
17
Figure 4
Figure 5
18
Figure 6
Figure 5 is the snapshot of the result obtained when three files are compared with
each other using common string algorithm. Figure 6 is the graph for the result
obtained.
4.3 Result of common word algorithm
Figure 7
19
Figure 8
4.4 Result
for Common
Block
algorithm
20
Figure 10
From the figures 4, 6,8,10 it has been observed that the plotted graphs for the three
same files is different. This is due to the fact the files has been plagiarized and some
changes are made. Out of these four results priority should be given to common block
algorithm since it identifies the plagiarized block of code.
21
CHAPTER 5
5.1 Conclusion:
The project is able to detect most of the plagiarism followed by the students. The aims
and objectives are clearly met. All the algorithms proposed are implemented
successfully, but GUI module is partially implemented. User should have a general
idea of all the algorithms and the normalized result should be considered. User should
fix some value as acceptance value depending upon the assignment given and
algorithm used, if the result for a file exceeds that value then that file is said to be
plagiarized.
22
REFERENCES
Journal Paper
[1] ALAN PARKER, JAMES O. HAMBLEN "computer algorithms for plagiarism
detection " IEEE Transactions on education , vol. 32,no 2, may 1989.
[2] Colin J. Neill and Ganesh Shanmuganthan ,”web enabled plagiarism detection
tool” published by IEEE computer society October 2004
Books
[3] Herbert Schildt "Java 2 the complete reference", 5th ed ,McGraw-Hill, 2002,
[4]G.S.Baluja “Algorithm analysis”, Dhanpatrai publications,2003
23
APPENDIX -I
CODE
/*
option=1--->byte to byte
option=2---->word to word
option=3--->string to string
option=4--->block to block
*/
import java.io.*;
import java.util.*;
class bl2
{
public static void main(String args[])throws Exception
{
int
size,i,ar1[],j=0,ar2[],org=100,start=0,end=0,k,m,great,small,strlen1,strlen2,n1,n2,balu=0,
x2=-1,x3=-1;
int option=1;
int x1=-1;
float match=0;
ar1=new int[1000];
ar2=new int[1000];
String st1[],st2[];
st1=new String[1000];
st2=new String[1000];
for(i=0;i<100;i++)
{
st1[i]="";
st2[i]="";
}
FileInputStream f0,f1;
do
{
i=f0.read();
ar1[j]=i;
j++;
}
24
while(i!=-1);
org=j-1; //Taking 1.txt as original file
ar1[j]=-1;
j=0;
do
{
i=f1.read();
ar2[j]=i;
j++;
}
while(i!=-1);
j=0;
/////////////////////////////////////
switch(option)
{
case 1:
f0=new FileInputStream(dirname1);
f1=new FileInputStream(dirname2);
do
{
i=f0.read();
ar1[j]=i;
j++;
}
while(i!=-1);
org=j-1; //Taking 1.txt as original file
ar1[j]=-1;
j=0;
do
{
i=f1.read();
ar2[j]=i;
j++;
}
while(i!=-1);
ar2[j]=-1;
j=0;
l:
{
do
{
if(ar1[j]==-1||ar2[j]==-1)break l;
if(ar1[j]==ar2[j])
match++;
j++;
25
}
while(j!=org);
}
break;
//////////////////////////////////////////////
case 4:
f0=new FileInputStream(dirname1);
f1=new FileInputStream(dirname2);
j=0;
do
{
i=f0.read();
x1++;
if(i=='{')
{
start=x1;
}
if(i=='}')
{
end=x1-1;
for(k=start;k<=end;k++)
{
st1[j]=st1[j]+(char)ar1[k];
}
j++;
}
}while(i!=-1);
strlen1=j;
System.out.println(strlen1);
org=j;
x1=-1;
j=0;
do
{
i=f1.read();
x1++;
if(i=='{')
{
26
start=x1;
}
if(i=='}')
{
end=x1-1;
for(k=start;k<=end;k++)
{
st2[j]=st2[j]+(char)ar2[k];
}
j++;
}
}
while(i!=-1);
strlen2=j;
System.out.println(strlen1);
small=strlen1<=strlen2?strlen1:strlen2;
great=strlen1>strlen2?strlen1:strlen2;
j=0;
match=0;
System.out.println(small);
System.out.println(great);
for(k=0;k<small;k++)
{
for(m=0;m<great;m++)
{
if(strlen1<=strlen2)
{
System.out.println("***********************************");
System.out.println(st1[k]);
System.out.println(st2[m]);
System.out.println("***********************************");
if(st1[k].equals(st2[m]))
{
match++;
}
}
if(strlen1>strlen2)
{
if(st2[k].equals(st1[m]))
match++;
}
27
}
}
System.out.println(match);
break;
///////////////////////////////////////
case 2:
x1=-1;
x2=0;
x3=-1;
f0=new FileInputStream(dirname1);
f1=new FileInputStream(dirname2);
j=0;
do
{
if(x3==-1)
{
i=f0.read();
x1++;
}
if((i==' '||i=='\n')&&(x2==0))
{
start=x1;
i=f0.read();
x1++;
x2=1;
x3=-1;
}
if((i==' '||i=='\n')&&(x2==1))
{
x3=0;
x2=0;
end=x1-1;
for(k=start;k<=end;k++)
{
st1[j]=st1[j]+(char)ar1[k];
}
j++;
}
}while(i!=-1);
x2=0;
strlen1=j;
System.out.println(strlen1);
28
org=j;
x1=-1;
x3=-1;
j=0;
do
{
if(x3==-1)
{
i=f1.read();
x1++;
}
if((i==' '||i=='\n')&&(x2==0))
{
x3=-1;
start=x1;
i=f1.read();
x1++;
x2=1;
}
if((i==' '||i=='\n')&&(x2==1))
{
x3=0;
end=x1-1;
x2=0;
for(k=start;k<=end;k++)
{
st2[j]=st2[j]+(char)ar2[k];
}
//System.out.println(st2[j]);
j++;
}
}
while(i!=-1);
strlen2=j;
System.out.println(strlen2);
small=strlen1<=strlen2?strlen1:strlen2;
great=strlen1>strlen2?strlen1:strlen2;
j=0;
match=0;
for(k=0;k<small;k++)
{
for(m=0;m<great;m++)
{
if(strlen1<=strlen2)
29
{
if(st1[k].equals(st2[m]))
{
match++;
}
}
if(strlen1>strlen2)
{
if(st2[m].equals(st1[k]))
{
match++;
}
}
}
}
break;
//////////////////////////////////////////////////////////
case 3:
x1=-1;
x2=0;
x3=-1;
f0=new FileInputStream(dirname1);
f1=new FileInputStream(dirname2);
j=0;
do
{
if(x3==-1)
{
i=f0.read();
x1++;
}
if((i==';'||i=='.')&&(x2==0))
{
start=x1;
i=f0.read();
x1++;
x2=1;
x3=-1;
}
if((i==';'||i=='.')&&(x2==1))
{
x3=0;
x2=0;
end=x1-1;
for(k=start;k<=end;k++)
30
{
st1[j]=st1[j]+(char)ar1[k];
}
System.out.println(st1[j]);
j++;
}
}while(i!=-1);
x2=0;
strlen1=j;
org=j;
System.out.println(strlen1);
x1=-1;
x3=-1;
j=0;
do
{
if(x3==-1)
{
i=f1.read();
x1++;
}
if((i==';'||i=='.')&&(x2==0))
{
x3=-1;
start=x1;
i=f1.read();
x1++;
x2=1;
}
if((i==';'||i=='.')&&(x2==1))
{
x3=0;
end=x1-1;
x2=0;
for(k=start;k<=end;k++)
{
st2[j]=st2[j]+(char)ar2[k];
}
System.out.println(st2[j]);
j++;
}
}
while(i!=-1);
strlen2=j;
//System.out.println(strlen2);
small=strlen1<=strlen2?strlen1:strlen2;
31
great=strlen1>strlen2?strlen1:strlen2;
j=0;
match=0;
//System.out.println(great);
for(k=0;k<=small;k++)
{
for(m=0;m<=great;m++)
{
//if(st1.length<=st2.length)
//{
if(st1[k].equals(st2[m]))
{
match++;
}
//}
//if(st1.length>st2.length)
//{
//if(st2[k].equals(st1[m]))
//match++;
}
}
//}
//match=match/2;
break;
//System.out.println(org);
System.out.println(match);
match=(match/org);
match=match*100;
System.out.println("Match percentage="+match);
f0.close();
f1.close();
}
}.
32