Sie sind auf Seite 1von 69

www.jntuworld.

com

www.jwjobs.net

Expr.No:

Date :

INTRODUCTION
WEKA (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. WEKA is an open source application that is freely available under the GNU general public license agreement. Originally written in C, the WEKA application has been completely rewritten in Java and is compatible with almost every computing platform. It is user friendly with a graphical interface that allows for quick set up and operation. WEKA operates on the predication that the user data is available as a flat file or relation. This means that each data object is described by a fixed number of attributes that usually are of a specific type, normal alpha-numeric or numeric values. The WEKA application allows novice users a tool to identify hidden information from database and file systems with simple to use options and visual interfaces.

The WEKAworkbench contains a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to this functionality. This original version was primarily designed as a tool for analyzing data from agricultural domains, but the more recent fully Java-based version (WEKA 3), for which development started in 1997, is now used in many different application areas, in particular for educational purposes and research.

ADVANTAGES OF WEKA
The obvious advantage of a package like WEKA is that a whole range of data preparation, feature selection and data mining algorithms are integrated. This means that only one data format is needed, and trying out and comparing different approaches becomes really easy.The package also comes with a GUI, which should make it easier to use. Portability, since it is fully implemented in the Java programming language and thus runs on almost any modern computing platform. A comprehensive collection of data preprocessing and modeling techniques. Ease of use due to its graphical user interfaces. WEKA supports several standard data mining tasks, more specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection. All of WEKA's techniques are predicated on the assumption that the data is available as a single flat file or relation, where each data point is described by a fixed number of attributes (normally, numeric or nominal attributes, but some other attribute types are also supported). 1

S.K.T.R.M College off Engineering

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

WEKA provides access to SQL databases using Java Database Connectivity and can process the result returned by a database query. It is not capable of multi-relational data mining, but there is separate software for converting a collection of linked database tables into a single table that is suitable for processing using WEKA. Another important area is sequence modeling. Attribute Relationship File Format (ARFF) is the text format file used by WEKA to store data in a database. The ARFF file contains two sections: the header and the data section. The first line of the header tells us the relation name. Then there is the list of the attributes (@attribute...). Each attribute is associated with a unique name and a type. The latter describes the kind of data contained in the variable and what values it can have. The variables types are: numeric, nominal, string and date. The class attribute is by default the last one of the list. In the header section there can also be some comment lines, identified with a '%' at the beginning, which can describe the database content or give the reader information about the author. After that there is the data itself (@data), each line stores the attribute of a single entry separated by a comma. WEKA's main user interface is the Explorer, but essentially the same functionality can be accessed through the component-based Knowledge Flow interface and from the command line. There is also the Experimenter, which allows the systematic comparison of the predictive performance of WEKA's machine learning algorithms on a collection of datasets.

Launching WEKA
The WEKA GUI Chooser window is used to launch WEKAs graphical environments. At the bottom of the window are four buttons: 1. Simple CLI. Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line Interface. 2. Explorer. An environment for exploring data with WEKA. 3. Experimenter. An environment for performing experiments and conducting. 4.Knowledge Flow. This environment supports essentially the same functions as the Explorer but with a drag-and-drop interface. One advantageis that it supports incremental learning.

S.K.T.R.M College off Engineering

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

If you launch WEKA from a terminal window, some text begins scrolling inthe terminal. Ignore this text unless something goes wrong, in which case it canhelp in tracking down the cause.This User Manual focuses on using the Explorer but does not explain theindividual data preprocessing tools and learning algorithms in WEKA. For moreinformation on the various filters and learning methods in WEKA, see the bookData Mining (Witten and Frank, 2005).

The WEKA Explorer


Section Tabs
At the very top of the window, just below the title bar, is a row of tabs. Whenthe Explorer is first started only the first tab is active; the others are greyedout. This is because it is necessary to open (and potentially pre-process) a dataset before starting to explore the data. The tabs are as follows: 1. Preprocess. Choose and modify the data being acted on. 2. Classify. Train and test learning schemes that classify or perform regression. 3. Cluster. Learn clusters for the data. 4. Associate. Learn association rules for the data. 5. Select attributes. Select the most relevant attributes in the data. 6. Visualize. View an interactive 2D plot of the data. Once the tabs are active, clicking on them flicks between different screens, onwhich the respective actions can be performed. The bottom area of the window(including the status box, the log button, and the WEKA bird) stays visibleregardless of which section you are in.

Status Box
The status box appears at the very bottom of the window. It displays messagesthat keep you informed about whats going on. For example, if the Explorer isbusy loading a file, the status box will say that. TIPright-clicking the mouse anywhere inside the status box brings up a little menu. The menu gives two options: 1. Available memory. Display in the log box the amount of memory available to WEKA. 2. Run garbage collector. Force the Java garbage collector to search formemory that is no longer needed and free it up, allowing more memoryfor new tasks. Note that the garbage collector is constantly running as abackground task anyway.

Log Button

S.K.T.R.M College off Engineering

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

Clicking on this button brings up a separate window containing a scrollable textfield. Each line of text is stamped with the time it was entered into the log. Asyou perform actions in WEKA, the log keeps a record of what has happened.

WEKA Status Icon


To the right of the status box is the WEKA status icon. When no processes arerunning, the bird sits down and takes a nap. The number beside the symbolgives the number of concurrent processes running. When the system is idle it iszero, but it increases as the number of processes increases. When any processis started, the bird gets up and starts moving around. If its standing but stopsmoving for a long time, its sick: something has gone wrong! In that case youshould restart the WEKA explorer.

1. Preprocessing
Opening files
The first three buttons at the top of the preprocess section enable you to loaddata into WEKA: 1. Open file.... Brings up a dialog box allowing you to browse for the datafile on the local file system. 2. Open URL.... Asks for a Uniform Resource Locator address for wherethe data is stored. 3. Open DB.... Reads data from a database. (Note that to make this work you might have to edit the file in WEKA/experiment/DatabaseUtils.props.) Using the Open file... button you can read files in a variety of formats: WEKAsARFF format, CSV format, C4.5 format, or serialized Instances format. ARFFfiles typically have a .arff extension, CSV files a .csv extension, C4.5 files adata and names extension, and serialized Instances objects a .bsi extension.

The Current Relation


Once some data has been loaded, the Preprocess panel shows a variety of information. The Current relation box (the current relation is the currentlyloaded data, which can be interpreted as a single relational table in databaseterminology) has three entries: 1. Relation. The name of the relation, as given in the file it was loadedfrom. Filters (described below) modify the name of a relation. 2. Instances. The number of instances (data points/records) in the data. 3. Attributes. The number of attributes (features) in the data.

Working with Attributes


Below the Current relation box is a box titled Attributes. There are threebuttons, and beneath them is a list of the attributes in the current relation. The list has three columns:

S.K.T.R.M College off Engineering

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

1. No... A number that identifies the attribute in the order they are specifiedin the data file. 2. Selection tick boxes. These allow you select which attributes are presentin the relation. 3. Name. The name of the attribute, as it was declared in the data file. When you click on different rows in the list of attributes, the fields changein the box to the right titled selected attribute. This box displays the characteristics of the currently highlighted attribute in the list: 1. Name. The name of the attribute, the same as that given in the attributelist. 2. Type. The type of attribute, most commonly Nominal or Numeric. 3. Missing. The number (and percentage) of instances in the data for whichthis attribute is missing (unspecified). 4. Distinct. The number of different values that the data contains for thisAttribute. 5. Unique. The number (and percentage) of instances in the data having avalue for this attribute that no other instances have.

Below these statistics is a list showing more information about the values storedin this attribute, which differ depending on its type. If the attribute is nominal,the list consists of each possible value for the attribute along with the numberof instances that have that value. If the attribute is numeric, the list givesfour statistics describing the distribution of values in the datathe minimum,maximum, mean and standard deviation. And below these statistics there is acolored histogram, color-coded according to the attribute chosen as the Classusing the box above the histogram. (This box will bring up a drop-down listof available selections when clicked.) Note that only nominal Class attributeswill result in a color-coding. Finally, after pressing the Visualize All button, histograms for all the attributes in the data are shown in a separate witting. Returning to the attribute list, to begin with all the tick boxes are unticked. They can be toggled on/off by clicking on them individually. The three buttons above can also be used to change the selection: 1. All. All boxes are ticked. 2. None. All boxes are cleared (unticked). 3. Invert. Boxes that are ticked become unticked and vice versa. Once the desired attributes have been selected, they can be removed by clicking the Remove button below the list of attributes. Note that this can beundone by clicking the Undo button, which is located next to the Save buttonin the top-right corner of the Preprocess panel.

S.K.T.R.M College off Engineering

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

Working with Filters


The preprocess section allows filters to be defined that transform the data invarious ways. The Filter box is used to set up the filters that are required.At the left of the Filter box is a Choose button. By clicking this button it ispossible to select one of the filters in WEKA. Once a filter has been selected, itsname and options are shown in the field next to the Choose button. Clickingon this box brings up a GenericObjectEditor dialog box.
The GenericObjectEditor Dialog Box

The GenericObjectEditor dialog box lets you configure a filter. The same kindof dialog box is used to configure other objects, such as classifiers and clusters(see below). The fields in the window reflect the available options. Clicking onany of these gives an opportunity to alter the filters settings. For example, thesetting may take a text string, in which case you type the string into the textfield provided. Or it may give a drop-down box listing several states to choosefrom. Or it may do something else, depending on the information required.Information on the options is provided in a tool tip if you let the mouse pointerhover of the corresponding field. More information on the filter and its optionscan be obtained by clicking on the More button in the About panel at the topof the GenericObjectEditor window.Some objects display a brief description of what they do in an About box,along with a More button. Clicking on the More button brings up a windowdescribing what the different options do. At the bottom of the GenericObjectEditor dialog are four buttons. The firsttwo, Open... and Save... allow object configurations to be stored for futureuse. TheCancel button backs out without remembering any changes that havebeen made. Once you are happy with the object and settings you have chosen, clickOK to return to the main Explorer window.
Applying Filters

Once you have selected and configured a filter, you can apply it to the data bypressing the Apply button at the right end of the Filter panel in the Preprocesspanel. The Preprocess panel will then show the transformed data. The changecan be undone by pressing the Undo button. You can also use the Edit...button to modify your data manually in a dataset editor. Finally, the Save...button at the top right of the Preprocess panel saves the current version of therelation in the same formats available for loading data, allowing it to be keptfor future use.

Note: Some of the filters behave differently depending on whether a class attribute has been set or not (using the box above the histogram, which willbring up a drop-down list of possible selections when clicked). In particular, thesupervised filters require a class attribute to be set, and some of the unsupervised attribute filters will skip the class attribute if one is set. Note that itis also possible to set Class to None, in which case no class is set .

S.K.T.R.M College off Engineering

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

2. Classification
Selecting a Classifier
At the top of the classify section is the Classifier box. This box has a text fieldthat gives the name of the currently selected classifier, and its options. Clickingon the text box brings up a GenericObjectEditor dialog box, just the same asfor filters that you can use to configure the options of the current classifier. TheChoose button allows you to choose one of the classifiers that are available inWEKA.

Test Options
The result of applying the chosen classifier will be tested according to the options that are set by clicking in the Test options box. There are four test modes: 1. Use training set. The classifier is evaluated on how well it predicts theclass of the instances it was trained on. 2. Supplied test set. The classifier is evaluated on how well it predicts theclass of a set of instances loaded from a file. Clicking the Set... buttonbrings up a dialog allowing you to choose the file to test on. 3. Cross-validation. The classifier is evaluated by cross-validation, usingthe number of folds that are entered in the Folds text field. 4. Percentage split. The classifier is evaluated on how well it predicts acertain percentage of the data which is held out for testing. The amountof data held out depends on the value entered in the % field. Note: No matter which evaluation method is used, the model that is output isalwaysthe one build from all the training data. Further testing options can beset by clickingon the More options... button: 1. Output model. The classification model on the full training set is outputso that it can be viewed, visualized, etc. This option is selected by default. 2. Output per-class stats. The precision/recall and true/false statisticsfor each class are output. This option is also selected by default. 3. Output entropy evaluation measures. Entropy evaluation measuresare included in the output. This option is not selected by default. 4. Output confusion matrix. The confusion matrix of the classifiers predictions is included in the output. This option is selected by default. 5. Store predictions for visualization. The classifiers predictions areremembered so that they can be visualized. This option is selected bydefault. 6. Output predictions. The predictions on the evaluation data are output.Note that in the case of a cross-validation the instance numbers do notcorrespond to the location in the data! S.K.T.R.M College off Engineering 7

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

7. Cost-sensitive evaluation. The errors is evaluated with respect to acost matrix. The Set... button allows you to specify the cost matrixused. 8. Random seed for xval / % Split. This specifies the random seed usedwhen randomizing the data before it is divided up for evaluation purposes.

The Class Attribute


The classifiers in WEKA are designed to be trained to predict a single classattribute, which is the target for prediction. Some classifiers can only learnnominal classes; others can only learn numeric classes (regression problems);still others can learn both. By default, the class is taken to be the last attribute in the data. If you wantto train a classifier to predict a different attribute, click on the box below theTest options box to bring up a drop-down list of attributes to choose from.

Training a Classifier
Once the classifier, test options and class have all been set, the learning processis started by clicking on the Start button. While the classifier is busy beingtrained, the little bird moves around. You can stop the training process at anytime by clicking on the Stop button. When training is complete, several things happen. The Classifier outputarea to the right of the display is filled with text describing the results of trainingand testing. A new entry appears in the Result list box. We look at the resultlist below; but first we investigate the text that has been output.

The Classifier Output Text


The text in the Classifier output area has scroll bars allowing you to browsethe results. Of course, you can also resize the Explorer window to get a largerdisplay area. The output is split into several sections: 1. Run information. A list of information giving the learning scheme options, relation name, instances, attributes and test mode that were involved in the process. 2. Classifier model (full training set). A textual representation of theclassification model that was produced on the full training data. 3. The results of the chosen test mode are broken down thus:

4. Summary. A list of statistics summarizing how accurately the classifierwas able to predict the true class of the instances under the chosen testmode. 5. Detailed Accuracy By Class. A more detailed per-class break downof the classifiers prediction accuracy. 6. Confusion Matrix. Shows how many instances have been assigned toeach class. Elements show the number of test examples whose actual class is the row and whose predicted class is the column.

S.K.T.R.M College off Engineering

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No: The Result List

Date :

After training several classifiers, the result list will contain several entries. Leftclicking the entries flicks back and forth between the various results that have been generated. Right-clicking an entry invokes a menu containing these items: 1. View in main window. Shows the output in the main window (just likeleft-clicking the entry). 2. View in separate window. Opens a new independent window for viewing the results. 3. Save result buffer. Brings up a dialog allowing you to save a text filecontaining the textual output. 4. Load model. Loads a pre-trained model object from a binary file. 5. Save model. Saves a model object to a binary file. Objects are saved inJava serialized object form. 6. Re-evaluate model on current test set. Takes the model that hasbeen built and tests its performance on the data set that has been specifiedwith the Set.. button under the Supplied test set option. 7. Visualize classifier errors. Brings up a visualization window that plotsthe results of classification. Correctly classified instances are representedby crosses, whereas incorrectly classified ones show up as squares. 7. Visualize tree or Visualize graph. Brings up a graphical representationof the structure of the classifier model, if possible (i.e. for decision treesor Bayesian networks). The graph visualization option only appears if aBayesian network classifier has been built. In the tree visualizer, you canbring up a menu by rightclicking a blank area, pan around by draggingthe mouse, and see the training instances at each node by clicking on it.CTRL-clicking zooms the view out, while SHIFT-dragging a box zoomsthe view in. The graph visualizer should be selfexplanatory. 8. Visualize margin curve. Generates a plot illustrating the predictionmargin. The margin is defined as the difference between the probabilitypredicted for the actual class and the highest probability predicted forthe other classes. For example, boosting algorithms may achieve betterperformance on test data by increasing the margins on the training data. Visualize threshold curve. Generates a plot illustrating the tradeoffsin prediction that are obtained by varying the threshold value betweenclasses. For example, with the default threshold value of 0.5, the predicted probability of positive must be greater than 0.5 for the instanceto be predicted as positive. The plot can be used to visualize the precision/recall tradeoff, for ROC curve analysis (true positive rate vs falsepositive rate), and for other types of curves.

9.

10. Visualize cost curve. Generates a plot that gives an explicit representation of the expected cost, as described by Drummond and Holte (2000).Options are greyed out if they do not apply to the specific set of results. S.K.T.R.M College off Engineering 9

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

3. Clustering
Selecting a Cluster
By now you will be familiar with the process of selecting and configuring objects.Clicking on the clustering scheme listed in the Clusterer box at the top ofthewindow brings up a GenericObjectEditor dialog with which to choose a newclustering scheme.

Cluster Modes
The Cluster mode box is used to choose what to cluster and how to evaluatethe results. The first three options are the same as for classification: Usetraining set, Supplied test set and Percentage split (Section 4)exceptthat now the data is assigned to clusters instead of trying to predict a specificclass. The fourth mode, Classes to clusters evaluation, compares how wellthe chosen clusters match up with a pre-assigned class in the data. The dropdown box below this option selects the class, just as in the Classify panel. An additional option in the Cluster mode box, the Store clusters forvisualization tick box, determines whether or not it will be possible to visualize the clusters once training is complete. When dealing with datasets that are solarge that memory becomes a problem it may be helpful to disable this option.

Ignoring Attributes
Often, some attributes in the data should be ignored when clustering. The Ignore attributes button brings up a small window that allows you to selectwhich attributes are ignored. Clicking on an attribute in the window highlightsit, holding down the SHIFT key selects a range of consecutive attributes, andholding down CTRL toggles individual attributes on and off. To cancel theselection, back out with the Cancel button. To activate it, click the Selectbutton. The next time clustering is invoked, the selected attributes are ignored.

Learning Clusters
The Cluster section, like the Classify section, has Start/Stop buttons, aresult text area and a result list. These all behave just like their classification counterparts. Right-clicking an entry in the result list brings up a similarmenu, except that it shows only two visualization options: Visualize clusterassignments and Visualize tree. The latter is grayed out when it is notapplicable.

S.K.T.R.M College off Engineering

10

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

4. Associating
Setting Up
This panel contains schemes for learning association rules, and the learners arechosen and configured in the same way as the clusterers, filters, and classifiersin the other panels.

Learning Associations
Once appropriate parameters for the association rule learner bave been set, clickthe Start button. When complete, right-clicking on an entry in the result listallows the results to be viewed or saved.

5. Selecting Attributes
Searching and Evaluating
Attribute selection involves searching through all possible combinations of attributes in the data to find which subset of attributes works best for prediction.To do this, two objects must be set up: an attribute evaluator and a searchmethod. The evaluator determines what method is used to assign a worth toeach subset of attributes. The search method determines what style of searchis performed.

Options
The Attribute Selection Mode box has two options: 1. Use full training set. The worth of the attribute subset is determinedusing the full set of training data. 2. Cross-validation. The worth of the attribute subset is determined by aprocess of cross-validation. The Fold and Seed fields set the number offolds to use and the random seed used when shuffling the data.As with Classify (Section 4), there is a drop-down box that can be used tospecify which attribute to treat as the class.

Performing Selection
Clicking Start starts running the attribute selection process. When it is finished, the results are output into the result area, and an entry is added tothe result list. Rightclicking on the result list gives several options. The firstthree, (View in main window, View in separate window and Save resultbuffer), are the same as for the classify panel. It is also possible to Visualizereduced data, or if you have used an attribute transformer such as PrincipalComponents, Visualize transformed data .

S.K.T.R.M College off Engineering

11

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

6. Visualizing
WEKAs visualization section allows you to visualize 2D plots of the currentrelation.

The scatter plot matrix


When you select the Visualize panel, it shows a scatter plot matrix for all theattributes, color coded according to the currently selected class. It is possible tochange the size of each individual 2D plot and the point size, and to randomlyjitter the data (to uncover obscured points). It also possible to change theattribute used to color the plots, to select only a subset of attributes for inclusionin the scatter plot matrix, and to sub sample the data. Note that changes willonly come into effect once the Update button has been pressed.

Selecting an individual 2D scatter plot


When you click on a cell in the scatter plot matrix, this will bring up a separatewindow with a visualization of the scatter plot you selected. (We describedabove how to visualize particular results in a separate windowfor example,classifier errorsthe same visualization controls are used here.)Data points are plotted in the main area of the window. At the top are twodrop-down list buttons for selecting the axes to plot. The one on the left showswhich attribute is used for the x-axis; the one on the right shows which is usedfor the y-axis. Beneath the x-axis selector is a drop-down list for choosing the colour scheme.This allows you to colour the points based on the attribute selected. Below theplot area, a legend describes what values the colours correspond to. If the valuesare discrete, you can modify the colour used for each one by clicking on themand making an appropriate selection in the window that pops up.To the right of the plot area is a series of horizontal strips. Each striprepresents an attribute, and the dots within it show the distribution of valuesof the attribute. These values are randomly scattered vertically to help you seeconcentrations of points. You can choose what axes are used in the main graphby clicking on these strips. Left-clicking an attribute strip changes the x-axisto that attribute, whereas rightclicking changes the y-axis. The X and Ywritten beside the strips shows what the current axes are (B is used for bothX and Y).Above the attribute strips is a slider labelled Jitter, which is a randomdisplacement given to all points in the plot. Dragging it to the right increases theamount of jitter, which is useful for spotting concentrations of points. Withoutjitter, a million instances at the same point would look no different to just asingle lonely instance.

Selecting Instances
There may be situations where it is helpful to select a subset of the data using the visualization tool. (A special case of this is the UserClassifier in theClassify panel, which lets you build your own classifier by interactively selectinginstances.)Below the y-axis selector button is a drop-down list button for choosing aselection method. A group of data points can be selected in four ways: 1. Select Instance. Clicking on an individual data point brings up a windowlisting its attributes. If more than one point appears at the same location,more than one set of attributes is shown.

S.K.T.R.M College off Engineering

12

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

2. Rectangle. You can create a rectangle, by dragging, that selects thepoints inside it. 3. Polygon. You can build a free-form polygon that selects the points insideit. Leftclick to add vertices to the polygon, right-click to complete it. Thepolygon will always be closed off by connecting the first point to the last. 4. Polyline. You can build a polyline that distinguishes the points on oneside from those on the other. Left-click to add vertices to the polyline,right-click to finish. The resulting shape is open (as opposed to a polygon,which is always closed). Once an area of the plot has been selected using Rectangle, Polygon orPolyline, it turns grey. At this point, clicking the Submit button removes allinstances from the plot except those within the grey selection area. Clicking onthe Clear button erases the selected area without affecting the graph.Once any points have been removed from the graph, the Submit buttonchanges to a Reset button. This button undoes all previous removals andreturns you to the original graph with all points included. Finally, clicking theSave button allows you to save the currently visible instances to a new ARFFfile.

S.K.T.R.M College off Engineering

13

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

WEKA (Waikato Environment for Knowledge Analysis )


WEKA Supports following File Formats
ARFF ( Attribute Relation File Format) CSV ( Comma Separated Values) C4.5 Binary Files ( True/False or Yes/No, Buys TV or Not .)

Confusion Matrix How well your classifier can recognize tuples of different classes

Predicted Class
C1 C2 False Negatives True Negatives

Actual C1 True Positives Class


C2 False Positives

True Positives : Refer to the Positive tuples that were correctly labeled by the classifier. True Negatives : Refer to the Negative tuples that were correctly labeled by the classifier. False Positives : Refer to the Negative tuples that were incorrectly labeled by the classifier.

False Negatives : Refer to the Positive tuples that were incorrectly labeled by the classifier.

S.K.T.R.M College off Engineering

14

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

1. Convert following relation into CSV file format. a. Write chart for result & Analyze the result b. Perform evaluation on training set c. Write summary and confusion matrix and justify it.
sno 1 2 3 4 5 6 7 8 9 10 11 12 NAME MANASA RAGHU VARDHAN PRIYANKA ZAKIA NAVEEN HARITHA PURUSHOTHAM BHANU PRAKASH ABHILASH SUDHEER PARUSHA RAMULU CHENNAKESHAVULU CD
71 64 75 64 32 79 54 60 12 69 56 87

MEFA
77 49 72 47 68 57 71 37 45 68 55 52

NS
75 61 77

OOAD VLSI
66 64 72 18 45 72 58 63 7 52 48 72 60 78 82 37 32 68 70 64 26 58 50 90

WT
61 79 65 39 56 87 59 44 27 57 58 88

36
49 86 66 61

11
68 47 81

sum 552 539 590 381 426 585 522 473 259 510 451 618

avg 73.60 71.87 78.67 50.80 56.80 78.00 69.60 63.07 34.53 68.00 60.13 82.40

Result PASS PASS PASS Fail Fail PASS PASS Fail Fail PASS PASS PASS

S.K.T.R.M College off Engineering

15

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

S.K.T.R.M College off Engineering

16

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

Working with ARFF files


2. Convert following relation into ARFF format. Analyze it and write confusion matrix and justify it.
Eno 1 2 3 4 5 6 7 8 Ename Pranav Karthika Ishaanth Mithrika Sriharsha Rishanth Sushanth Sai Rudhra Salary 45000 35000 35000 50000 45000 50000 45000 35000 deptname CSE ECE EEE CSE ECE CSE EEE ECE

S.K.T.R.M College off Engineering

17

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

3. Convert following relation into ARFF format. a. Analyze it b. Classify records using J48 and generate Decision Tree c. Write confusion matrix and justify it.
Outlook Sunny Overcast Rainy sunny Sunny Overcast Rainy Temperature 45 20 30 50 48 28 35 Humidity 15 30 20 10 15 36 25 Windy False False True True True False False Play Yes yes No No Yes Yes No

S.K.T.R.M College off Engineering

18

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

S.K.T.R.M College off Engineering

19

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

4. Convert following Transaction database into ARFF format.


Transcation id T1 T2 T3 T4 T5 T6 T7 T8 T9 Items Bought I1,I2,I5 I2,I4 I2,I3 I1,I2,I4 I1,I3 I2,I3 I1,I3 I1,I2,I3,I5 I1,I2,I3

S.K.T.R.M College off Engineering

20

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

5. Convert following Transaction database into ARFF format.


Transcation id 1 2 3 4 5 Items Bought {M, O, N, K, E, Y} {D, O, N, K, E, Y} {M, A, K, E } {M, U, C, K, Y} {C, O, O, K, I, E}

S.K.T.R.M College off Engineering

21

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

Description of the German credit dataset in ARFF (Attribute Relation File Format) Format:
Structure of ARFF Format:
%comment lines @relation relation name @attribute attribute name @Data Set of data items separated by commas. % 1. Title: German Credit data % % 2. Source Information % % Professor Dr. Hans Hofmann % Institutf"urStatistik und "Okonometrie % Universit"at Hamburg % FB Wirtschaftswissenschaften % Von-Melle-Park 5 % 2000 Hamburg 13 % % 3. Number of Instances: 1000 % % Two datasets are provided. the original dataset, in the form provided % by Prof. Hofmann, contains categorical/symbolic attributes and % is in the file "german.data". % % For algorithms that need numerical attributes, Strathclyde University % produced the file "german.data-numeric". This file has been edited % and several indicator variables added to make it suitable for % algorithms which cannot cope with categorical variables. Several % attributes that are ordered categorical (such as attribute 17) have % been coded as integer. This was the form used by StatLog. % % 6. Number of Attributes german: 20 (7 numerical, 13 categorical) % Number of Attributes german.numer: 24 (24 numerical) % % 7. Attribute description for german % % Attribute 1: (qualitative) % Status of existing checking account % A11 : ... < 0 DM % A12 : 0 <= ... < 200 DM % A13 : ... >= 200 DM / % salary assignments for at least 1 year % A14 : no checking account % Attribute 2: (numerical) % Duration in month % % Attribute 3: (qualitative) % Credit history % A30 : no credits taken/ % all credits paid back duly % A31 : all credits at this bank paid back duly S.K.T.R.M College off Engineering 22

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:
% A32 : existing credits paid back duly till now % A33 : delay in paying off in the past % A34 : critical account/ % other credits existing (not at this bank) % % Attribute 4: (qualitative) % Purpose % A40 : car (new) % A41 : car (used) % A42 : furniture/equipment % A43 : radio/television % A44 : domestic appliances % A45 : repairs % A46 : education % A47 : (vacation - does not exist?) % A48 : retraining % A49 : business % A410 : others % % Attribute 5: (numerical) % Credit amount % % Attibute 6: (qualitative) % Savings account/bonds % A61 : ... < 100 DM % A62 : 100 <= ... < 500 DM % A63 : 500 <= ... < 1000 DM % A64 : .. >= 1000 DM % A65 : unknown/ no savings account % % Attribute 7: (qualitative) % Present employment since % A71 : unemployed % A72 : ... < 1 year % A73 : 1 <= ... < 4 years % A74 : 4 <= ... < 7 years % A75 : .. >= 7 years % % Attribute 8: (numerical) % Installment rate in percentage of disposable income % % Attribute 9: (qualitative) % Personal status and sex % A91 : male : divorced/separated % A92 : female : divorced/separated/married % A93 : male : single % A94 : male : married/widowed % A95 : female : single % % Attribute 10: (qualitative) % Other debtors / guarantors % A101 : none % A102 : co-applicant % A103 : guarantor % S.K.T.R.M College off Engineering

Date :

23

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:
% Attribute 11: (numerical) % Present residence since % % Attribute 12: (qualitative) % Property % A121 : real estate % A122 : if not A121 : building society savings agreement/ % life insurance % A123 : if not A121/A122 : car or other, not in attribute 6 % A124 : unknown / no property % % Attribute 13: (numerical) % Age in years % % Attribute 14: (qualitative) % Other installment plans % A141 : bank % A142 : stores % A143 : none % % Attribute 15: (qualitative) % Housing % A151 : rent % A152 : own % A153 : for free % % Attribute 16: (numerical) % Number of existing credits at this bank % % Attribute 17: (qualitative) % Job % A171 : unemployed/ unskilled - non-resident % A172 : unskilled - resident % A173 : skilled employee / official % A174 : management/ self-employed/ % highly qualified employee/ officer % % Attribute 18: (numerical) % Number of people being liable to provide maintenance for % % Attribute 19: (qualitative) % Telephone % A191 : none % A192 : yes, registered under the customers name % % Attribute 20: (qualitative) % foreign worker % A201 : yes % A202 : no % % 8. Cost Matrix % % This dataset requires use of a cost matrix (see below) % % S.K.T.R.M College off Engineering

Date :

24

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:
% 1 2 % ---------------------------% 1 0 1 % ----------------------% 2 5 0 % % (1 = Good, 2 = Bad) % % the rows represent the actual classification and the columns % the predicted classification. % % It is worse to class a customer as good when they are bad (5), % than it is to class a customer as bad when they are good (1). % % % Relabeled values in attribute checking_status % From: A11 To: '<0' % From: A12 To: '0<=X<200' % From: A13 To: '>=200' % From: A14 To: 'no checking' % % % Relabeled values in attribute credit_history % From: A30 To: 'no credits/all paid' % From: A31 To: 'all paid' % From: A32 To: 'existing paid' % From: A33 To: 'delayed previously' % From: A34 To: 'critical/other existing credit' % % % Relabeled values in attribute purpose % From: A40 To: 'new car' % From: A41 To: 'used car' % From: A42 To: furniture/equipment % From: A43 To: radio/tv % From: A44 To: 'domestic appliance' % From: A45 To: repairs % From: A46 To: education % From: A47 To: vacation % From: A48 To: retraining % From: A49 To: business % From: A410 To: other % % % Relabeled values in attribute savings_status % From: A61 To: '<100' % From: A62 To: '100<=X<500' % From: A63 To: '500<=X<1000' % From: A64 To: '>=1000' % From: A65 To: 'no known savings' % % % Relabeled values in attribute employment % From: A71 To: unemployed % From: A72 To: '<1' S.K.T.R.M College off Engineering

Date :

25

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:
% From: A73 To: '1<=X<4' % From: A74 To: '4<=X<7' % From: A75 To: '>=7' % % % Relabeled values in attribute personal_status % From: A91 To: 'male div/sep' % From: A92 To: 'female div/dep/mar' % From: A93 To: 'male single' % From: A94 To: 'male mar/wid' % From: A95 To: 'female single' % % % Relabeled values in attribute other_parties % From: A101 To: none % From: A102 To: 'co applicant' % From: A103 To: guarantor % % % Relabeled values in attribute property_magnitude % From: A121 To: 'real estate' % From: A122 To: 'life insurance' % From: A123 To: car % From: A124 To: 'no known property' % % % Relabeled values in attribute other_payment_plans % From: A141 To: bank % From: A142 To: stores % From: A143 To: none % % % Relabeled values in attribute housing % From: A151 To: rent % From: A152 To: own % From: A153 To: 'for free' % % % Relabeled values in attribute job % From: A171 To: 'unemp/unskilled non res' % From: A172 To: 'unskilled resident' % From: A173 To: skilled % From: A174 To: 'high qualif/self emp/mgmt' % % % Relabeled values in attribute own_telephone % From: A191 To: none % From: A192 To: yes % % % Relabeled values in attribute foreign_worker % From: A201 To: yes % From: A202 To: no % % S.K.T.R.M College off Engineering

Date :

26

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

% Relabeled values in attribute class % From: 1 To: good % From: 2 To: bad % @relation german_credit @attribute checking_status { '<0', '0<=X<200', '>=200', 'no checking'} @attribute duration real @attribute credit_history { 'no credits/all paid', 'all paid', 'existing paid', 'delayed previously', 'critical/other existing credit'} @attribute purpose { 'new car', 'used car', furniture/equipment, radio/tv, 'domestic appliance', repairs, education, vacation, retraining, business, other} @attribute credit_amount real @attribute savings_status { '<100', '100<=X<500', '500<=X<1000', '>=1000', 'no known savings'} @attribute employment { unemployed, '<1', '1<=X<4', '4<=X<7', '>=7'} @attribute installment_commitment real @attribute personal_status { 'male div/sep', 'female div/dep/mar', 'male single', 'male mar/wid', 'female single'} @attribute other_parties { none, 'co applicant', guarantor} @attribute residence_since real @attribute property_magnitude { 'real estate', 'life insurance', car, 'no known property'} @attribute age real @attribute other_payment_plans { bank, stores, none} @attribute housing { rent, own, 'for free'} @attribute existing_credits real @attribute job { 'unemp/unskilled non res', 'unskilled resident', skilled, 'high qualif/self emp/mgmt'} @attribute num_dependents real @attribute own_telephone { none, yes} @attribute foreign_worker { yes, no} @attribute class { good, bad} @data '<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male single',none,4,'real estate',67,none,own,2,skilled,1,yes,yes,good '0<=X<200',48,'existingpaid',radio/tv,5951,'<100','1<=X<4',2,'female 'real estate',22,none,own,1,skilled,1,none,yes,bad div/dep/mar', none,2,

'no checking',12,'critical/other existing credit',education,2096,'<100','4<=X<7',2,'male single',none,3,'real estate',49,none,own,1,'unskilled resident',2,none,yes,good '<0',42,'existing paid',furniture/equipment, 7882,'<100','4<=X<7',2,'male single', guarantor, 4, 'life insurance', 45,none,'for free',1,skilled,2,none,yes,good '<0',24,'delayed previously','new car',4870,'<100','1<=X<4',3,'male single',none,4,'no known property',53,none,'for free',2,skilled,2,none,yes,bad 'no checking',36,'existing paid',education,9055,'no known savings','1<=X<4',2,'male single', none,4,'no known property',35,none,'for free',1,'unskilled resident',2,yes,yes,good

S.K.T.R.M College off Engineering

27

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

Lab Experiments
1. List all the categorical (or nominal) attributes and the real-valued attributes separately.
From the German Credit Assessment Case Study given to us,the following attributes are found to be applicable for Credit-Risk Assessment: Total Valid Attributes
Categorical or Nominal attributes (which takes True/false,etc values)

Real valued attributes

1. checking_status 2. duration 3. credit history 4. purpose 5. credit amount 6. savings_status 7. employment duration 8. installment rate 9. personal status 10. debitors 11. residence_since 12. property 14. installment plans 15. housing 16. existing credits 17. job 18. num_dependents 19. telephone 20. foreign worker

S.K.T.R.M College off Engineering

28

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

2. What attributes do you think might be crucial in making the credit assessment? Come up with some simple rules in plain English using your selected attributes.
According to me the following attributes may be crucial in making the credit risk assessment.

S.K.T.R.M College off Engineering

29

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

S.K.T.R.M College off Engineering

30

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

Based on the above attributes, we can make a decision whether to give credit or not.

S.K.T.R.M College off Engineering

31

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

3. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training.
A decision tree is a flow chart like tree structure where each internal node(nonleaf)denotes a test on the attribute, each branch represents an outcome of the test ,and each leaf node(terminal node)holds a class label. Decision trees can be easily converted into classification rules. e.g. ID3,C4.5 and CART. J48 pruned tree The resulting window in WEKA is as follows:

S.K.T.R.M College off Engineering

32

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

S.K.T.R.M College off Engineering

33

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

The decision tree above is unclear due to a large number of attributes.

S.K.T.R.M College off Engineering

34

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

4. Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy?
In the above model we trained complete dataset and we classified credit good/bad for each of the examples in the dataset. For example: IFpurpose=vacation THEN credit=bad; ELSEpurpose=business THEN credit=good; In this way we classified each of the examples in the dataset.

S.K.T.R.M College off Engineering

35

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

S.K.T.R.M College off Engineering

36

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

5. Is testing on the training set as you did above a good idea? Why Why not?

S.K.T.R.M College off Engineering

37

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

6. One approach for solving the problem encountered in the previous question is using cross-validation? Describe what cross-validation is briefly. Train a Decision Tree again using cross-validation and report your results. Does your accuracy increase/decrease? Why?
Cross validation:In k-fold cross-validation, the initial data are randomly portioned into k mutuallyexclusive subsets or folds D1, D2, D3, . . . . . ., Dk. Each of approximately equal size. Training and testing isperformed k times. In iteration I, partition Di is reserved as the test set and the remaining partitions arecollectively used to train the model. That is in the first iteration subsets D2, D3, . . . . . ., Dk collectivelyserve as the training set in order to obtain as first model. Which is tested on Di. The second trained onthe subsets D1, D3, . . . . . ., Dk and test on the D2 and so on.

S.K.T.R.M College off Engineering

38

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

1. Select classify tab and J48 decision tree and in the test option select crossvalidation radio button and the number of folds as 10. 2. Number of folds indicates number of partition with the set of attributes. 3. Kappa statistics nearing 1 indicates that there is 100% accuracy and hence all the errors will be zeroed out,but in reality there is no such training set that gives 100% accuracy.

Cross Validation Result at folds:10 for the table GermanCreditData:


Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

Here there are 1000 instances with 100 instances per partition.

Cross Validation Result at folds:20 for the table GermanCreditData:


Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

S.K.T.R.M College off Engineering

39

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

Cross Validation Result at folds:50 for the table GermanCreditData:


Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

Cross Validation Result at folds:100 for the table GermanCreditData:


Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

S.K.T.R.M College off Engineering

40

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

Percentage split does not allow 100%, it allows only till 99.9%

PercentageSplit Result at 50%: (Above screen shot)


S.K.T.R.M College off Engineering 41

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No: PercentageSplit Result at 50%:


Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances PercentageSplit Result at 99.9%: Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

Date :

S.K.T.R.M College off Engineering

42

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

7. Check to see if the data shows a bias against "foreign workers"


(attribute 20),or "personal-status"(attribute 9). One way to do this (Perhaps rather simple minded) is to remove these attributes fromthedataset and see if the decision tree created in those cases is significantly different from the full dataset case which you have already done. To remove an attribute you can use the reprocess tab in WEKA's GUI Explorer. Did removing these attributes have any significant effect? Discuss.
This increases in accuracy because the two attributes foreign workers and personal status are not much important in training and analyzing.By removing this, the time has been reduced to some extent and then it results inincrease in the accuracy.The decision tree which is created is very large compared to the decision tree whichwe have trained now. This is the main difference between these two decision trees.

After removing forign worker, write summary in table shown below Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

S.K.T.R.M College off Engineering

43

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

After removing 9th attribute (Personal Status), write summary in table shown below Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

S.K.T.R.M College off Engineering

44

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:
.

Date :

Using Cross validation : After removing 9th attribute (Personal Status), write
summary in table shown below Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

S.K.T.R.M College off Engineering

45

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

Using Percentage split: After removing 9th attribute (Personal Status), write
summary in table shown below Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

S.K.T.R.M College off Engineering

46

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

Using Cross validation


After removing forign worker, write summary in table shown below Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

After removing the 20th attribute (forign worker), the cross validation is as above.

S.K.T.R.M College off Engineering

47

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

Using Percentage split


After removing forign worker, write summary in table shown below Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

After removing 20th attribute (forign worker),the percentage split is as above.

S.K.T.R.M College off Engineering

48

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

8. Another question might be, do you really need to input so many attributes toget good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7Remember to reload the ARFF data file to get all the attributes initially beforeyou start selecting the ones you want.)
Select attribute 2,3,5,7,10,17,21 and click on invert to remove the remaining attributes.

Classify using J48 and collect summary and write in below table.
Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

sd S.K.T.R.M College off Engineering 49

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

Select random attributes and then check the accuracy.

Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

S.K.T.R.M College off Engineering

50

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

After removing the attributes 1,4,6,8,9,11,12,13,14,15,16,18,19 and 20,we select the left over attributes and visualize them. Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

After we removing above 14 attributes, classify using J48 and Cross validation Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

After we removing above 14 attributes, classify using J48 and Percentage split Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

S.K.T.R.M College off Engineering

51

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

9. Sometimes, the cost of rejecting an applicant who actually has a good credit Case 1. might be higher than accepting an applicant who has bad credit Case 2.Instead of counting the misclassifications equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in WEKA. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)?

S.K.T.R.M College off Engineering

52

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:
1.Select classify tab. 2. Select More Option from Test Option.

Date :

3.Tick on cost sensitive Evaluation and click on set.

S.K.T.R.M College off Engineering

53

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

4.Set classes as 2. 5.Click on Resize and then well get cost matrix. 6.Then change the 2nd entry in 1st row and 2nd entry in 1st column to 5.0 as shown below. 0.0 5.0 5.0 0.0 7.Then confusion matrix will be generated and you can find out the difference between good and bad attribute. 8. Check accuracy whether its changing or not. Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

S.K.T.R.M College off Engineering

54

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

10. Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model?
When we consider long complex decision trees, we will have many unnecessary attributes in the tree which results in increase of the bias of the model. Because of this, the accuracy of the model can also effect. This problem can be reduced by considering simple decision tree. The attributes will be lessand it decreases the bias of the model. Due to this the result will be more accurate. So it is a good idea to prefer simple decision trees instead of long complex trees. 1. Open any existing ARFF file e.g labour.arff. 2. In preprocess tab,select ALL to select all the attributes. 3. Go to classify tab and then use traning set with J48 algorithm.

S.K.T.R.M College off Engineering

55

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

4.

To generate the decision tree, right click on the result list and select visualize tree option,by which the decision tree will be generated .

S.K.T.R.M College off Engineering

56

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

5. Right click on J48 algorithm to get Generic Object Editor window 6. In this,make the unpruned option as true . 7. Then press OK and then start. we find the tree will become more complex if not pruned.

Visualize tree

S.K.T.R.M College off Engineering

57

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:
8.The tree has become more complex.

Date :

S.K.T.R.M College off Engineering

58

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

11. You can make your Decision Trees simpler by pruning the node s. Oneapproach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation (you can do this in WEKA) and report the Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase?

S.K.T.R.M College off Engineering

59

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:
Reduced-error pruning:-

Date :

1. Right click on J48 algorithm to get Generic Object Editor window 2. In this,make reduced error pruning option as true and also the unpruned option as true . 3. Then press OK and then start.

S.K.T.R.M College off Engineering

60

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

4. Find the accuracy has been increased/ decreased by selecting the reduced error pruning option and write in below table and justify it.
Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

S.K.T.R.M College off Engineering

61

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

12. (Extra Credit): How can you convert a Decision Trees into "if-thenelserules". Make up your own small Decision Tree consisting of 2-3 levels andconvert it into a set of rules. There also exist different classifiers that output themodel in the form of rules - one such classifier in WEKAis rules.PART, train thismodel and report the set of rules obtained. Sometimes just one attribute canbe good enough in making the decision, yes, just one! Can you predict whatattribute that might be in this dataset? OneR classifier uses a single attribute tomake decisions (it chooses the attribute based on minimum error). Report therule obtained by training a one R classifier. Rank the performance of j48, PARTand oneR.
In WEKA, rules.PART is one of the classifier which converts the decision trees intoIF THEN-ELSE rules. Converting Decision trees into IF-THEN-ELSE rules using rules. PART classifier:PART decision list outlook = overcast: yes (4.0) windy = TRUE: no (4.0/1.0) outlook = sunny: no (3.0/1.0) : yes (3.0) Number of Rules : 4 Yes, sometimes just one attribute can be good enough in making the decision. In this dataset (Weather), Single attribute for making the decision is outlook outlook: sunny -> no overcast -> yes rainy -> yes (10/14 instances correct) With respect to the time, the oneR classifier has higher ranking and J48 is in 2 ndplace and PART gets 3rd place. J48 PART oneR TIME (sec) 0.12 0.14 0.04 RANK II III I But if you consider the accuracy, The J48 classifier has higher ranking, PART gets second place and oneR gets lst place J48 PART oneR ACCURACY (%) 70.5 70.2% 66.8%

S.K.T.R.M College off Engineering

62

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:
1.Open existing file as weather.nomial.arff 2.Select All. 3.Goto classify. 4.Start.

Date :

== Evaluation on training set === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

S.K.T.R.M College off Engineering

63

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

Here the accuracy is 100%

S.K.T.R.M College off Engineering

64

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

The tree is something like if-then-else rule. Write if-then-else rules using decision tree in following table.

S.K.T.R.M College off Engineering

65

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:
To click out the rules

Date :

1. Go to choose then click on Rule then select PART.

S.K.T.R.M College off Engineering

66

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:
2. Click on Save and start. == Evaluation on training set === === Summary === Using PART Algorithm Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 3. Similarly for OneR algorithm. 4. Click on Save and start.

Date :

S.K.T.R.M College off Engineering

67

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

== Evaluation on training set === === Summary === Using OneR Algorithm Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

S.K.T.R.M College off Engineering

68

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Expr.No:

Date :

S.K.T.R.M College off Engineering

69

www.jntuworld.com

Das könnte Ihnen auch gefallen