Predicting Significant Datasets Using Decision Tree Techniques For Software Defect Analysis

IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm

A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 5, Issue 7, July 2017 ISSN 2321-5992
PREDICTING SIGNIFICANT DATASETS

USING DECISION TREE TECHNIQUES FOR
SOFTWARE DEFECT ANALYSIS
E. Suganthi1, Dr. S. Prakasam2
1
M. phil(Research Scholar) Department of Computer Science Application SCSVMVUniversity,Enathur
2
Associate Professor Department of Computer Science Application SCSVMV University,Enathur
ABSTRACT
Data mining techniques in software engineering can act as unique underlying methods, since it affects the preprocessing as
well as the post analysis. Software defect predictors are useful to maintain the high quality of software products effectively. In
this research is to establish a method for identifying software defects using preprocessing, cluster and decision tree methods. In
this work we used NASAs and Promise dataset as software metrics data.Decision trees are one of the most popular approaches
for both classification and preprocessing type predictions. They are generated based on specific rules. Decision tree is a
classifier in a tree structure. The software prediction of defects the software modules can help the software developers to
allocate the available dataset to deliver high quality software products. The software defect prediction is the process to be
finding the many dataset defective software modules as possible without affecting the overall performance. Software
engineering consists of collecting software engineering data (NASA and Premise) to extracting and data mining the dataset has
emerged as a successful software defect prediction multiple datasets.The NASA and Promise datasets are containing the varied
quantities of correlated instances and attributes, which are useful for checking data integrity. Additionally, it is possible to use
domain specific expertise to validate data integrity. The significance of the results has been tested via decision tree analysis
performed by using preprocess cluster algorithms. the software defect predict analyses in a continuous and disciplined
approach that brings many recompense such as accurate dataset results and classify the decision tree mechanism, and
improving the software prediction and process qualities.
Keywords: Software Engineering, Decision tree, PreProcessing Cluster, WEKA Tool,Software defect Prediction,
Dataset.
1. INTRODUCTION
As software engineering generates huge amount of data, it is important to utilize it properly so that the problems
regarding the software development cycle can be solved efficiently. In this paper, we focus on to find best dataset in
each type of Software Engineering [1] data, and the specific data mining techniques that can solve those problems.
Classification can be described as a function that maps (classifies) a data item into one of the several predefined classes
[3]. Here the goal is to induce a model that can be used to classify future data items with unknown classification into
unique classes. In software development process the performance of classifier depends upon the type and class of data.
The software systems that we work with are inherently complex and difficult to conceptualize. This complexity lead to
faults [4][9] and defects as result increases the cost of software. Software metric dataset have long been a standard tool
for assessing quality of software systems and the processes that produce them.Clusterclassification analysis is a group of
multivariate techniques whose primary purpose is to group entities based on their attributes.
2. LITERATURE SURVEY
N. Fenton, M. Neil,(2011)A Critique of Software Defect Prediction Models. Many organizations want to predict the
number of defects (faults) in software systems, before they are deployed, to gauge the likely delivered quality and
maintenance effort. To help in this numerous software metrics and statistical models have been developed, with a
correspondingly large literature. We provide a critical review of this literature and the state-of-the-art.
Volume 5, Issue 7, July 2017 Page 73

Y. Freund, L. Mason (1999)The Alternating Decision Tree Learning Algorithm. An alternating decision tree consists
of decision nodes and prediction nodes. Decision nodes specify a predicate condition. Prediction nodes contain a single
number. ADTrees always have prediction nodes as both root and leaves. An instance is classified by an ADTree by
following all paths for which all decision nodes are true and summing any prediction nodes that are traversed.
Y. Brun.,(2003) Software fault identification via dynamic analysis and machine learning.In this propose a technique
that identifies program properties that may indicate errors. The technique generates machine learning models of run-
time program properties known to expose faults, and applies these models to program properties of user-written code to
classify and rank properties that may lead the user to errors finding technique.
Henry, S., and Kafura, D(1999)The Evaluation of Software Systems Structure Using Quantitative Software Metrics,:
The design and analysis of the structure of software systems has typically been based on purely qualitative grounds. In
this paper we report on our positive experience with a set of quantitative measures of software structure. These metrics,
based on the number of possible paths of information flow through a given component.
H. L. Larsen,(1999)AnApproach to Flexible Information Access Systems Using Soft Computing.

We present a scheme for modeling expert like flexibility in query-answering through extending information bases with
a knowledge-based soft computing layer for query processing.The query answer is a subset of the most satisfying
objects. An envelope calibration method is proposed for fast retrieval of these objects from the information base.
J. MacQueen,(1967)Some methods for classification and analysis of multivariate observations.The main purpose of this
paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The
process, which is called 'k-means,' appears to give partitions which are reasonably efficient in the sense of within-class
variance. That is, if p is the probability mass function for the population, S = {S1, S2, - * *, Sk} is a partition of EN,
and ui, i = 1, 2, * - , k, is the conditional mean of p over the set Si, then W2(S) = ff= ISf i z - u42 dp(z) tends to be low
for the partitions S generated by the method.
R. R. Papalkar and G. Chanidel,(2013)Clustering in web text mining and its application in ieeeabstract classification
Text Mining, a branch of computer science , is the process ofextracting patterns from large data sets by combining
methods from statistics and artificial intelligence with database management. Text Mining is seen as an increasingly
important tool by modern business totransform data into business intelligence giving an informational advantage.
Salleb, Ansaf and ChristelVrain(2000)An Application of Assosiation Knowledge Discovery and Data Mining.The
rapidly emerging field of knowledge discovery in databases (kdd) has grown significantly in the past few years. This
growth is driven by a mix of daunting practical needs and strong research interest. The technology for computing and
storage has enabled people to collect and store information from a wide range of sources at rates that were, only a few
years ago, considered unimaginable
.
3. METHODOLOGY & TOOLS
Weka tool supports many different standard data mining in software engineering dataset tasks such as data
preprocessing, classification, clustering, decision tree (J48), data visualization and feature selection. The basic premise
of the application is to utilize a computer application that can be trained data set to perform clustering capabilities and
derive useful information in the form of trends and patterns.
In software defects Prediction is the task of predicting continuous or ordered values for given dataset input However, as
we have seen in[9], some classification techniques such as preprocessing and clustering classification rule algorithms
can be adapted for prediction to find the best dataset using weka tool.
Decision Tree
Data mining in weka tool decision tree is a decision to support that uses a tree like graph decisions and their possible
after-effect, as well as event resource costs, intergarity, results, and utility. A Decision Tree, or a cluster classification
decision tree (J48), is used to learn a classification rule function which concludes the value of a dependent instances

and attribute (variable) given the dataset values of the independent (input) instances and attributes (variables). This
verifies a problem known as supervised classification rule because the dependent attribute and the counting of classes
(values) are given [2].
Clustering algorithm (also referred to as soft clustering) is a form of clustering in which each data point can belong to
more than one cluster. Clustering or cluster analysis involves assigning data points to clusters (also called buckets, bins,
or classes), or homogeneous classes, such that items in the same class or cluster are as similar as possible datasets,
while items belonging to different attribute classes are as dissimilar as possible. Clusters are identified via similarity
measures. These similarity measures include distance, connectivity, and intensity. Different similarity measures may be
chosen based on the data or the application.
4. IMPLEMENTATION
WEKA is the collection or a suite of the tools for performing data mining with the implementation of the
Classification classifier rules in it. Basically it is a compilation of datasetcluster classificationrule for the task of data
mining, which is able to be applied directly to dataset or can call from your own java code. It is compilation or suite of
tools for performing the dataset preprocessing, Cluster classification, regression, clustering, association rules and
visualization type operations and it also can be enhance any new preprocessing scheme.
A classifier model is an arbitrary complex mapping from all-but-one dataset attributes to the class attribute. The
specific (NASA and promise) dataset form and creation of this mapping, or model, differs from classifier to classifier.
For example [14], ZeroRs (= weka.classifiers.rules.ZeroR) model just consists of a single value: the most common
class, or the median of all numeric values in case of predicting a numeric value (= regression learning). The
performance of the learners on the MDP data was assessed using receiver-operator (ROC) curves. Formally, a defect
predictor hunts for a signal that a software module is defect prone. Signal detection theory [17] offers ROC curves as an
analysis method for assessing different predictors. Any learning algorithm inWEKA is derived from the abstract
weka.classifiers[18].AbstractClassifier class. This, in turn, implements weka.classifiers.Classifier.
Top-down induction of decision trees (TDIDT, old approach knowfrom pattern recognition):
Select an attribute for root node and create a branch for each possible attribute value.
Split the instances into subsets (one for each branch extending from the node).
Repeat the procedure recursively for each branch, using only instances that reach the branch (those that satisfy the
conditions along the path from the root to the branch).
Stop if all instances have the same class.ID3, C4.5, J48 (WEKA): Select the attribute that minimizes the class
entropy in the split.
Figure 1 J48 Decision tree classifier

Visualize MergeCurve:
Figure 2 Visualizing merge curve
Comparing different classifiers on one dataset can also be done via plot metric curves, not just via Accuracy,
Correlation coefficient etc. In the Explorer it is not possible to do that for several classifiers, this is only possible in
the Knowledge Flow.
The actual clustering for this algorithm is shown as one instance for each cluster representing the cluster centroid.
Figure 3 Cluster Analysis
Cluster analysis groups objects (observations, events) based on the information found in the data describing the objects
or their relationships. The goal is that the objects in a group will be similar (or related) to one other and different from
(or unrelated to) the objects in other groups. The greater the likeness (or homogeneity) within a group, and the greater
the disparity between groups, the better or more distinct the clustering.

Cluster Output:
Figure 4 Cluster Analysis Output

5. CONCLUSION
The conclusion of this research work has proposed a new approach for efficiently predicting the best Data Set has been
used for experimental purpose. The data mining tool WEKA tool has been used for generate the modified J-48 model
classifiers. Experimental results have shown a significant improvement over the existing J-48 algorithm. It has been
proved that the proposed algorithm can achieve accuracy. Also the decision tree algorithm generates rules in the
classification process. These rules are used for deciding which branches to select towards the leaf nodes in the tree.
Also data mining is as good as results it produces so quality and quantity of available data and computational cost
determines the success of data mining in software development process in weka tool.
FUTURE ENHANCEMENT
As a future work, different clustering algorithm or improved versions of the used advanced cluster andmachine
learning algorithms may be included in the experiments. The algorithms used in our evaluation experiments are the
simplest forms of some widely used methods. Also this model can be applied to other risk assessment procedures which
can be supplied as input to the system. Certainly these risk issues should have quantitative representations to be
considered as an input for our system. We have recognized reasons why software engineering is a good fit for data
mining technique, including the inherent complexity of development, pitfalls of raw metrics and the difficulties of
understanding software processes.
REFRENCES
[1] Q. Taylor and C. Giraud-Carrier, Applications of data mining in software engineering, Int. J. Data Analysis
Techniques and Strategies,2010.
[2] Z. Li and M. Reformat, A practical method for the software fault prediction, in proceedings of IEEE International
Conference Information Reuse and Integration (IRI),2007.
[3] M. Shtern and Vassilios, Review Article Advances in Software Engineering Clustering Methodologies for
Software Engineering, Tzerpos Volume,2012.
[4]. Y. Brun. Software fault identification via dynamic analysis and machine learning. Masters thesis, MIT Dept. of
EECS, Aug. 16, 2003,
[5]. Agrawal, Rakesh and RamakrishnanSrikant, Fast Algorithms for Mining & Preprocessing Assosiation Rules,
Proceedings of the 20th VLDB Conference, Santiago, Chile (1994).
[6]. J. Demar, "Statistical Comparisons of Classifiers over Multiple Data Sets", J. Machine Learning Research, vol. 7,
pp. 1-30, 2006.
[7] K. El-Emam, S. Benlarbi, N. Goel, S.N. Rai, "Comparing Case-Based Reasoning Classifiers for Predicting High-
Risk Software Components", J. Systems and Software, vol. 55, no. 3, pp. 301-320, 2001.
[8] Agrawal, R., Imielinski, T., and Swami, A. N. 1993. Mining association rules between sets of items in large
databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, 207-216.

[9] H. Lu and B. Cukic.An adaptive approach with active learning in software fault prediction. In Proceedings of the
8th International Conference on Predictive Models in Software Engineering, PROMISE 12, pages 7988, New
York, NY, USA, 2012. ACM.
[10]. Henry, S., and Kafura, D., 1984. The Evaluation of Software Systems Structure Using Quantitative Software
Metrics, Software Practice and Experience, vol. 14, no. 6, pp. 561-573
[11]. Yuriy, B., and Ernst, M. D., 2004. Finding latent code errors via machine learning over program
executions.Proceedings of the 26th International Conference on Software Engineering, (Edinburgh, Scotland).
[12]. M. Elsner, E. Charniak, and M. Johnson, Structured generative models for unsupervised named-entity
clustering, in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North
American Chapter of the Association for Computational Linguistics (NAACL 09). Stroundsburg, PA: Association
for Computational Linguistics, 2009, pp. 164172.
[13.]Norman Fenton, Paul Krause, and Martin Neil. A probabilistic model for software defect prediction.IEEE Trans
Software Eng, 2001
[14]. Boetticher G, Menzies T. and Ostrand T., PROMISE Repository of empirical software engineering data
http://promisedata.org/repository, West Virginia University, Department of Computer Science, 2007
[15]. Fenton, N.E. and Neil, M., A critique of software defect prediction models, IEEE Transactions. on Software.
Engineering., 25(5), 1999, pp. 675689.
[16]. Khoshgoftaar, T. M. and Seliya, N., Fault Prediction Modeling for Software Quality Estimation: Comparing
Commonly Used Techniques, Empirical Software Engineering., 8(3), 2003, pp. 255-283.
[17].Menzies, T., DiStefano, J., Orrego, A., Chapman, R., Assessing Predictors of Software Defects, In Proceedings
of Workshop Predictive Software Models, 2004.
[18.]Menzies T., Greenwald, J., Frank, A., Data mining static code attributes to learn defect predictors, IEEE
Transactions on Software Engineering, 33(1), 2007, pp. 213.
[19.]Munson, J. and Khoshgoftaar, T. M., The Detection of Fault-Prone Programs, IEEE Transactions on Software
Engineering., 18(5), 1992, pp. 423-433.
[20] Padberg, F., Ragg T., Schoknecht R., Using machine learning for estimating the defect content after an
inspection, IEEE Transactions on Software Engineering, 30(1), 2004, pp: 17- 28.

Predicting Significant Datasets Using Decision Tree Techniques For Software Defect Analysis

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Predicting Significant Datasets Using Decision Tree Techniques For Software Defect Analysis

Hochgeladen von

Copyright:

Verfügbare Formate

IPASJ International Journal of Computer Science (IIJCS)

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm

PREDICTING SIGNIFICANT DATASETS

Volume 5, Issue 7, July 2017 Page 73

H. L. Larsen,(1999)AnApproach to Flexible Information Access Systems Using Soft Computing.

Volume 5, Issue 7, July 2017 Page 74

Figure 1 J48 Decision tree classifier

Volume 5, Issue 7, July 2017 Page 75

Figure 2 Visualizing merge curve

Figure 3 Cluster Analysis

Volume 5, Issue 7, July 2017 Page 76

Figure 4 Cluster Analysis Output

Volume 5, Issue 7, July 2017 Page 77

Volume 5, Issue 7, July 2017 Page 78

Das könnte Ihnen auch gefallen