0 views

Uploaded by kamalshrish

- A Comparative study on Classification and Clustering Techniques Using Assorted Data Mining Tools
- Jeevan 2011
- Change Detectiom
- Introduction to Statistics and Economies
- Ch07
- Comparacion AADEE_AVL Version Ingles New
- Microsoft Word - Final Issue Vol 5 Num 3 January 2013.Doc
- MultipleRegression_CompleteHierarchicalProblems_spring2006
- Homework 7 Regression Analysis
- MinitabSPSS (1)
- DLL WS CISSE07 v1
- Chap 006
- Multiple Regression Analysis
- A2 - Teuku Hilman Revanda - 001201500038
- IJETTCS-2013-08-25-121
- face-and-name-matching-in-a-movie-by-grapical-methods-in-dynamic-way
- Linear regression
- 2
- Checking Assumptions of Normality Regression STATA
- LEOWorks_Manual_EN.pdf

You are on page 1of 7

based on the difference among data elements belonging to different classes.

Clustering groups data elements into different groups based on the similarity

between elements with a single group.

Definition

Classification and prediction are two forms of data analysis that can be used to

extract models describing important data classes or to predict future data

trends.

Such analysis can help provide us with a better understanding of the data at

large.

Whereas classification predicts categorical (discrete, unordered) labels,

prediction model continuous valued functions.

Many classification and prediction methods have been proposed by researchers

in machine learning, pattern recognition, and statistics.

Preparing the Data for Classification and Prediction

Comparing Classification and Prediction Methods

The following preprocessing steps may be applied to the data to help improve the

accuracy, efficiency, and scalability of the classification or prediction process:

Data Cleaning

This refers to the preprocessing of data in order to remove or reduce noise and the

treatment of missing values.

Relevance analysis

Many of the attributes in the data may be redundant. Correlated analysis can be used

to identify whether any two given attributes are statistically related.

Data Transformation and reduction

The data may be transformed by normalization, particularly when neural networks or

methods involving distance measurements are used in the learning steps.

Classification and prediction methods can be compared and evaluated according to

the following criteria:

Accuracy

This refers to the ability of a given classifier to correctly predict the class label of new

or previously unseen data (i.e., tuples without class label information)

Speed

This refers to the computational costs involved in generating and using the given

classifier or predictor.

Robustness

This refers to ability of the classifier or predictor to make correct predictions given

noisy data or data with missing values

Scalability

This refers to the ability to construct the classifier or predictor efficiently given large

amounts of data

Interpretability

This refers to the level of understanding and insight that is provided by the classifier

or predictor.

Page 1 Unit 8

Classification Techniques

Decision Tree Identification

Classification Problem

Weather Play( Yes, No)

Given N element types and m decision classes:

1. For i 1 to N do

i. Add element I to the i-1 element item-sets from the previous iteration

iii. If an item-set has only one decision class, then that item-set is done, remove

that item-set from subsequent iterations.

.2 done

Classification Techniques

Decision Tree Identification

Sunny Yes

Cloudy Yes/No

Overcast Yes/No

Page 2 Unit 8

Decision Tree Identification

Cloudy Yes

Warm

Cloudy No

Chilly

Cloudy Yes

Pleasant

Overcast

Warm

Overcast No

Chilly

Overcast Yes

Pleasant

Page 3 Unit 8

Decision Tree Identification Example:

Top down technique for decision tree identification

Decision tree created is sensitive to the order in which items are considered

If an N-item-set does not result in a clear decision.

Classification classes have to be modeled by rough sets.

Clustering Techniques

Clustering partitions the data set into clusters or equivalence classes.

Similarity among members of a class more than similarity among members

across classes.

Similarity measures: Euclidian distance or other application specific measures.

Clustering Techniques

Nearest Neighbour Clustering Algorithm:

Given n elements X1, X2, …. Xn, and threshold t, .

1. j 1, k 1, cluster = { }

2. Repeat

I. Find the nearest neighbour of xj

II. Let the nearest neighbour be in cluster m

III. If distance to nearest neighbour >t, then create a new cluster and

k k+1; else assign xj to cluster m

IV. j j+1

3. until j>n

Given n element x1, x2, . . . Xn, and k clusters, each with a center.

1. Assign each element to its closest cluster center

2. After all assignments have been made, compute the cluster centroids for each

of the cluster

3. Repeat the above two steps with the new centroids until the algorithm

converges

Regression

Numeric prediction is the task of predicting continuous (or ordered) values for given

input

For example:

We may wish to predict the salary of college graduates with 10 years of work

experience, or the potential sales of a new product given its price.

The mostly used approach for numeric prediction is regression

A statistical methodology that was developed by Sir Frances Galton (1822-1911), a

mathematician who was also a cousin of Charles Darwin

In many texts use the terms “regression” and “numeric prediction” synonymously

Regression analysis can be used to model the relationship between one or more

independent or predictor variables and a dependent or response variable (which is

continuous value)

In the context of data mining, the predictor variables are the attributes of interest

describing the tuple

The response variable is what we want to predict

Page 4 Unit 8

Types of Regression

The types of Regression are as:

Linear Regression

NonLinear Regression

Linear Regression

Straight-line regression analysis involves a response variable, y, and a single

predictor variable, x.

It is the simplest form of regression, and models y as a linear function of x.

That is,

y=b+wx

Where the variance of y is assumed to be constant, and b and w are regression

coefficients specifying the Y-intercept and slope of the line, respectively.

The regression coefficient, w and b, can also be thought of as weight, so that we can

equivalent write,

y=w0+w1x.

The regression coefficient can be estimated using this method with the following

equations:

[Refer to write board:]

Example Too:

The multiple linear regression is an extension of straight-line regression so as to

involve more than one predictor variable.

An example of a multiple linear regression model based on two predictor attributes or

variables, A1 and A2, is

y=w0+w1x1+w2x2,

Multiple regression problems are instead commonly solved with the use of statistical

software packages, such as SPSS(Statistical Package for the Social Sciences),

etc..

NonLinear Regression

The straight-line linear regression case where dependent response variable, y, is

modeled as a linear function of a single independent predictor variable, x.

If we can get more accurate model using a nonlinear model, such as a parabola or

some other higher-order polynomial?

Polynomial regression is often of interest when there is just one predictor variable.

Consider a cubic polynomial relationship given by

y=w0+w1x+w2xsq2+w3xcu3

NonLinear Regression

In statistics, nonlinear regression is a form of regression analysis in which

observational data are modeled by a function which is a nonlinear combination of the

model parameters and depends on one or more independent variables. The data are

fitted by a method of successive approximations.

Contents

Clustering

Page 5 Unit 8

Definition

The process of grouping a set of physical or abstract objects into classes of similar

objects is called clustering.

A cluster is a collection of data objects that are similar to one another within the same

cluster and are dissimilar to the objects in other clusters. A cluster of data objects can

be treated collectively as one group and so may be considered as a form of data

compression.

First the set is partitioned into groups based on data similarity (e.g., using

clustering), and then labels are assigned to the relatively small number of groups.

It is also called unsupervised learning. Unlike classification, clustering and

unsupervised learning do not rely on predefined classes and class-labeled training

examples. For this reason, clustering is a form of learning by observation, rather than

learning by examples.

Definition

Clustering is also called data segmentation in some applications because clustering

partitions large data sets into groups according to their similarity.

Clustering can also be used for outlier detection, where outliers (values that are “far

away” from any cluster) may be more interesting than common cases.

Advantages

Advantages of such a clustering-based process:

Adaptable to changes

Helps single out useful features that distinguish different groups.

Applications of Clustering

Market research

Pattern recognition

Data analysis

Image processing

Biology

Geography

Automobile insurance

Outlier detection

K-Mean Algorithm

Input:

k: the number of clusters,

D: a data set containing n objects.

Output: A set of k clusters.

Method:

(1) arbitrarily choose k objects from D as the initial cluster centers;

(2) repeat

(3) (re)assign each object to the cluster to which the object is the most similar, based

on the mean value of the objects in the cluster;

(4) update the cluster means, i.e., calculate the mean value of the objects for each

cluster;

(5) until no change;

Page 6 Unit 8

K-Medoids Algorithm

Input:

k: the number of clusters,

D: a data set containing n objects.

Output: A set of k clusters.

Method:

(1) arbitrarily choose k objects in D as the initial representative objects or seeds;

(2) repeat

(3) assign each remaining object to the cluster with the nearest representative

object;

(4) randomly select a nonrepresentative object, orandom;

(5) compute the total cost, S, of swapping representative object, oj, with orandom;

(6) if S < 0 then swap oj with orandom to form the new set of k representative

objects;

(7) until no change;

Bayesian Classification

Bayesian Classification is based on Baye’s theorem.

Studies comparing classification algorithms have found a simple Bayesian classifier

known as the naïve Bayesian classifier to be comparable in performance with decision

tree and selected neural network classifiers.

Bayesian classifiers have also exhibited high accuracy and speed when applied to

large database.

Naïve Bayesian classifiers assume that the effect of an attribute value on a given

class is independent of the values of the other attributes. This assumption is called

class conditional independence.

Bayesian belief networks are graphical models, which unlike naïve Bayesian

classifiers, allow the representation of dependencies among subsets of attributes.

Bayes’ Theorem

Bayes’

Page 7 Unit 8

- A Comparative study on Classification and Clustering Techniques Using Assorted Data Mining ToolsUploaded byIJAFRC
- Jeevan 2011Uploaded byDennaya Listya Dias
- Change DetectiomUploaded byaaa
- Introduction to Statistics and EconomiesUploaded bychuky5
- Ch07Uploaded byأبوسوار هندسة
- Comparacion AADEE_AVL Version Ingles NewUploaded byShereen Salah
- Microsoft Word - Final Issue Vol 5 Num 3 January 2013.DocUploaded byRachmat Zufrijal
- MultipleRegression_CompleteHierarchicalProblems_spring2006Uploaded byAzrul Fazwan
- Homework 7 Regression AnalysisUploaded byTiera Bates
- MinitabSPSS (1)Uploaded byppdat
- DLL WS CISSE07 v1Uploaded byapi-3734323
- Chap 006Uploaded bybill1999
- Multiple Regression AnalysisUploaded byboricuaese
- A2 - Teuku Hilman Revanda - 001201500038Uploaded byteuku revanda
- IJETTCS-2013-08-25-121Uploaded byAnonymous vQrJlEN
- face-and-name-matching-in-a-movie-by-grapical-methods-in-dynamic-wayUploaded byIJSTR Research Publication
- Linear regressionUploaded byYashu Sharma
- 2Uploaded byAdel Akel
- Checking Assumptions of Normality Regression STATAUploaded byAce Cosmo
- LEOWorks_Manual_EN.pdfUploaded byAlade1988
- Demand Forecasting 1Uploaded byGoel Vaibhav
- Polish LiquidityUploaded byMuhammadAzharFarooq
- First Things FirstUploaded byAlexandru Ionuţ Pohonţu
- Kuliah 1 PendahuluanUploaded byTeddy Ekatamto
- Lesson 03 - Association Between VariablesUploaded byJacob Dziubek
- Q_CT3Uploaded bypooja sharma
- 18n97456xUploaded byyip90
- 20_LinearRegressionUploaded byRose Lee
- ioneliness internet addiction spiritual inteligence.pdfUploaded byMiguel Barata Gonçalves
- Regresion 1Uploaded byRHA

- Unit 5Uploaded bykamalshrish
- Exam InsertUploaded bykamalshrish
- Database AdministrationUploaded bykamalshrish
- Unit+3+-+Data+Warehouse+Physical+DesignUploaded bykamalshrish
- OOP 2072Uploaded bykamalshrish
- Database SecurityUploaded bykamalshrish
- 2074 DAA 5th SemUploaded bykamalshrish
- Approval SheetUploaded bykamalshrish
- Cognos Security Overview1Uploaded bykamalshrish
- B.sc .CSIT Revised SyllabusUploaded bykamalshrish
- B.sc .CSIT Revised SyllabusUploaded bykamalshrish
- Sad - Data Modeling - Data Modeling - WattpadUploaded bykamalshrish
- InternshipUploaded byArun Lamichhane
- data science.txtUploaded bykamalshrish
- Draft ICTEd 476-AI EducationUploaded bykamalshrish
- Unit 3 Emerging DatabaseUploaded bykamalshrish
- BEd.ict Nepali 2ndUploaded bykamalshrish
- AI Full Doc.docxUploaded bykamalshrish
- dbms.docUploaded bykamalshrish
- Unit 5 StandardsUploaded bykamalshrish
- Advantages of CentralizedUploaded bykamalshrish
- Advantages and Extra Functions of DistributedUploaded bykamalshrish
- ABCUploaded bykamalshrish
- Unit 1 Relational ModelUploaded bykamalshrish
- Kamal 8th Sem Mining 2074Uploaded bykamalshrish
- Unit 4 New Database ApplicationUploaded bykamalshrish
- Unit 2 EER ModelUploaded bykamalshrish
- Advantages of DistributedUploaded bykamalshrish
- Kamal 6th Sem Web 2074 - CopyUploaded bykamalshrish

- Blau, Peter. Abstract Expressionism and Formal Structuralism in Art and SociologyUploaded byis taken ready
- Evaluation of Pre-overlay Rut Depth for Local Calibration of the MEPDG Rutting Model in Ontario PavementsUploaded byAfzal Waseem
- 24. Model of Degradation Kinetics for Coconut Oil AtUploaded byFahmi Mubarokz
- 14 Ofun Otura InglesUploaded byMario Ifakoya Oyekanmi Oyekale
- Quantifying Influence on TwitterUploaded bykaranghai4732
- 127Uploaded byrucsandra
- Statically Verifying Continuous Integration ConfigurationsUploaded byJexia
- NBA Oracle Midway ReportUploaded byEdwin Saputra
- Gibson Meta SfUploaded byDarko Nesic
- 3Uploaded byhaibye424
- 1-s2.0-S1359645410001308-mainUploaded byPourya Noury
- Trends in Technical Skills for Alumni PortalUploaded byInternational Journal of Innovative Science and Research Technology
- 1-s2.0-S2214579615000155-main vanessa.pdfUploaded byJailton Nonato
- Notes on the Lucas CritiqueUploaded byJuan Manuel Cisneros García
- MIR - Tarasov L. - The World is Built on ProbabilityUploaded byavast2008
- Ben Ulmer, Matt Fernandez, Predicting Soccer Results in the English Premier LeagueUploaded byCosmin George
- Minority Report for Fraud DetectionUploaded byjbsimha3629
- Millward Brown ModelUploaded byjbaksi
- Lesson Plan With DatesUploaded bydasariorama
- Simplified Cooling Time Calculation ZarkadasUploaded byJayesh Modi
- Paper Part IIUploaded byPrasanta Sahoo
- mlmus3_ch10Uploaded byEdwin Johny Asnate Salazar
- OM Test Bank _Chapte3Uploaded byHossam Shemait
- ForecastingUploaded byDaniel Ong
- Residual Life Predictions From Vibration-based Degradation Signals - A Neural Network Approach IEEE TransUploaded byĐoàn Đình Đông
- MA Jyothish 2nd Yr 10Uploaded byjeydownload
- Using Poisson Distribution to predict a Soccer Betting WinnerUploaded bySyed Ahmer Rizvi
- isearchUploaded byapi-321573651
- Predict MeansUploaded byRafael
- Analytical Structural VibrationUploaded byNeil Wilsnach