Sie sind auf Seite 1von 10

Introduction

Data: Data refers to raw facts usually collected, as a result of experience, observation, experiment or processes within a computer system or a set of premises. Data are often viewed as a lowest level of abstraction from which information and knowledge are derived Database: A database is a structured collection of records or data that is stored on your computer system. The structure is achieved by organizing the data according to the database model. Database Management System : A Database Management System is a computer program that enables, modify or store information in a database Information: Information refers to the collection of processed data Repository: A repository is a central place in which an aggregating data is kept and maintained in an organized way usually in computer storage. Data Warehouse: A datawarehouse is a subject-oriented, integrated, timevariant and non-volatile collection of data in support of management decision making process. Data mining: Extraction of interesting information or patterns from data in large databases.

Data Mining Functionalities:


Concept Description: Characterization and Discrimination Characterization: Summarization of general characteristics or features of a target class. Discrimination: Comparing the general features of target class objects with that of the contrasting class. 2. Association: It is a process of finding the associations. An example of an association rule : Buys (x, computer) =>buys (x, software) [Support=1%, confidence=50%] 3. Classification: Classification is a process of finding models, that describe and distinguish classes or concepts for future prediction 4. Cluster Analysis: Grouping of data without any prior knowledge is known as clustering 5. Outlier Analysis: Database may contain data objects that dont comply with the general behavior or model of the data. These data objects are outliers. The analysis of outlier data is referred to as outlier analysis. 6. Evolution Analysis: It describes models, regularities or trends for objects whose behavior changes over time
1.

Data Preprocessing:
Data Preprocessing is used to avoid dirty, incomplete, noisy and inconsistent data. Major tasks in data preprocessing: 1. Data Cleaning 2. Data Integration 3. Data Transformation 4. Data Reduction 5. Data Discretization

Data Mining Tools


About Data Mining tool WEKA: It is an acronym for Waikato Environment for Knowledge Analysis. It is a Java-based Machine Learning Tool Features Of Weka : 49 data preprocessing tools 76 classification/regression algorithms 8 clustering algorithms 15 attribute/subset evaluators + 10 search algorithms for feature selection 3 algorithms for finding association rules 3 graphical user interfaces Weka Software: http://www.cs.waikato.ac.nz/ml/weka/ Data mining software in Java Open source software

WEKA: versions:
There are several versions of WEKA: WEKA 3.0: book version compatible with description in data mining book WEKA 3.2: GUI version adds graphical user interfaces (book version is command-line only) WEKA 3.3: development version with lots of improvements

About File formats WEKA Understands:

WEKA understands ARFF, CSV, C4.5 and binary file formats. In total, WEKA understands flat files only.

Exploring WEKA:
When you open WEKA, it looks like below. WEKA GUI has 4 tabs: Simple CLI, Explorer, Experimenter, Knowledge flow

Experiment1: Creation of ARFF file


ARFF file can be created in two ways: First Way: 1. Create an Excel file like below

2. Save AS filename.XLS 3. Open the same file but now save as filename. CSV and save as type: CSV (delimited) 4. Open the same file but now save as filename. CSV and save as type: CSV (delimited)

5. below will be created

file

like

6. Now open the filename. CSV with MS-Word and type the format as below

7. Now save as filename.ARFF and save as type : plain text 8. An ARFF file will get created like below

9. If you click this ARFF file, directly you Will enter into the WEKA GUI environment like below

Experiment 2: Association rule mining using APRIORI in WEKA Explorer


The APRORI Algorithm : The APRORI is an influential algorithm for mining frequent itemsets for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties. Apriori employs an iterative approach

known as level-wise search, where k-itemsets are used to explore (k+1) itemsets. Algorithm : Input : Database D, min_sup. Output : L, frequent itemsets in D. Method : (1) L1= find_frrequent_1-itemsets (2) For (k=2;Lk-1; k++) { (3) Ck= apriori_gen(Lk-1, min_sup); (4) For each transaction t D { (5) Ct =subset (Ck, t); (6) For each candidate c Ct (7) c.count++; (8) } (9) Lk ={c Ck|c.count >= min_sup} (10) } (11) Return L=UkLk; Procedure apriori_gen (Lk-1:frequent (k-1)-itemsets;min_sup) (1) for each itemset l1Lk-1 (2) for each itemset l2 Lk-1 (3) if (l1[1]=l2[1])(l1[2]=l2[2]).(l1[k-2]=l2[k-2])(l1[k1]<l2[k-1]) then (4) c=l1l2; (5) if has_infrequent_subset(c, Lk-1) then (6) delete c; (7) else add c to Ck; (8) } (9) Return Ck Procedure has_infrequent_subset(c:candidate k-itemset;lk-1) (1) for each (k-1)-subsets of c (2) if s Lk-1 then (3) return true; (4) return false;

APRIORI in WEKA Explorer: 1. Open WEKA GUI, a window like below will come

2. Now Click the Explorer tab, a window like below will come

3. Click the open file button and select the arff file to load in your WEKA like below..

4. After loading , all your data will be seen in your Explorer window of WEKA like below

5. Click on Associate tab, then a window like below will come

5.

When you click the choose button, a list of association rule mining algorithms will come. In that click APRIORI

6. We can even set the properties, that is, adjustment of support and confidence thresholds by right clicking on that..

7. When you click on start button, your algorithm will run and the output (Strong association rules) will be appeared in the output window

Das könnte Ihnen auch gefallen