Sie sind auf Seite 1von 90

Classification

Unit-III
• It is a Data analysis task, i.e. the process of finding a model that
describes and distinguishes data classes and concepts.
• Classification is the problem of identifying to which of a set of
categories (sub populations), a new observation belongs to, on the
basis of a training set of data containing observations and whose
categories membership is known.
• Example: Before starting any Project, we need to check it’s feasibility.
• In this case, a classifier is required to predict class labels such as ‘Safe’
and ‘Risky’ for adopting the Project and to further approve it.
• There are two forms of data analysis that can be used for extracting models
describing important classes or to predict future data trends.
• These two forms are as follows −
oClassification
oPrediction
 Classification models predict categorical class labels
 and prediction models predict continuous valued functions.
 For example, we can build a classification model to categorize bank loan
applications as either safe or risky, or a prediction model to predict the
expenditures in dollars of potential customers on computer equipment
given their income and occupation.
What is classification?
• Following are the examples of cases where the data analysis task is
Classification −
• A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer with
a given profile, who will buy a new computer.
• In both of the above examples, a model or classifier is constructed to
predict the categorical labels.
• These labels are risky or safe for loan application data and yes or no
for marketing data.
What is prediction?
• Following are the examples of cases where the data analysis task is
Prediction −
• Suppose the marketing manager needs to predict how much a given
customer will spend during a sale at his company.
• In this example we are bothered to predict a numeric value.
• Therefore the data analysis task is an example of numeric prediction.
Classification contd…
• Classification is a data mining function that assigns items in a
collection to target categories or classes.
• The goal of classification is to accurately predict the target class for
each case in the data.
• For example, a classification model could be used to identify loan
applicants as low, medium, or high credit risks.
• A classification task begins with a data set in which the class
assignments are known.
• For example, a classification model that predicts credit risk could be
developed based on observed data for many loan applicants over a
period of time.
• In addition to the historical credit rating, the data might track
employment history, home ownership or rental, years of residence,
number and type of investments, and so on.
• Credit rating would be the target, the other attributes would be the
predictors, and the data for each customer would constitute a case.
• The simplest type of classification problem is binary classification:
• In binary classification, the target attribute has only two possible
values: for example, high credit rating or low credit rating.
• Multi-class targets have more than two values: for example, low,
medium, high, or unknown credit rating.
• CLASSIFICATION is a classic data mining technique based on machine
learning.
• Basically, classification is used to classify each item in a set of data
into one of a predefined set of classes or groups.
• Classification method makes use of mathematical techniques such as
decision trees, linear programming, neural network and statistics.
• In classification, we develop the software that can learn how to classify the
data items into groups.
• For example, we can apply classification in the application that given all
records of employees who left the company, predict who will probably
leave the company in a future period.
• In this case, we divide the records of employees into two groups that
named “leave” and “stay”.
• And then we can ask our data mining software to classify the employees
into separate groups.
• A simple classification method is Naive Bayesian classification that
uses conditional probability.
oI think it is used for recognizing spam emails and automatically
discarding it in the spam box.
• Also K-nearest-neighbour is another algorithm used to classify new
entries based on their features.
• Clustering algorithms can also be used to classify documents based
on their content. One of the most famous is K-means.
• LDA(Linear discriminant analysis) method is also a way for classifying
text from a semantic point of view.
• Fitness wrist bands such as fitbit ones are trained to recognize users
activity such as walking, cycling, running or sleeping.
• This way they can measure and report activities with great precision.
• How Does Classification Works?
• With the help of the bank loan application that we have discussed, let
us understand the working of classification. The Data Classification
process includes two steps −
• Building the Classifier or Model
• Using Classifier for Classification
Building the Classifier or Model
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built from the training set made up of database tuples
and their associated class labels.
• Each tuple that constitutes the training set is referred to as a category
or class. These tuples can also be referred to as sample, object or data
points.
• Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the
model learn using the training set available. Model has to be trained
for prediction of accurate results.
• Classification Step: Model used to predict class labels and testing the
constructed model on test data and hence estimate the accuracy of
the classification rules.
Training and Testing:
• Suppose there is a person who is sitting under a fan and the fan starts
falling on him, he should get aside in order not to get hurt.
• So, this is his training part to move away.
• While Testing if the person sees any heavy object coming towards him
or falling on him and moves aside then system is tested positively and
if the person do not moves aside then the system is negatively tested.
• Same is the case with the data, it should be trained in order to get the
accurate and best results.
Syntax…
• Mathematical Notation: Classification is based on building a function
taking input feature vector “X” and predicting its outcome “Y” .
• Here Classifier (or model) is used which is a Supervised function, can
be designed manually based on expert’s knowledge. It has been
constructed to predict class labels (Example: Label – “Yes” or “No” for
the approval of some event).
• Classifiers can be categorized on two major types:
1. Discriminative: It is a very basic classifier and determines just one
class for each row of data. It tries to model just by depending on
the observed data, depends heavily on quality of data rather than
on distributions.
Example: Logistic Regression
Acceptance of a student at a University (Test and Grades need to be
considered)
Suppose there are few students and the Result of them are as
follows:
2. Generative: It models the distribution of individual classes and tries to
learn the model that generates the data behind the scenes by estimating
assumptions and distributions of the model.
• Used to predict the unseen data.
• Example: Naive Bayes Classifier: Detecting Spam emails by looking at the
previous data.
• Suppose 100 emails and that too divided in 1:4 i.e. Class A: 25%(Spam
emails) and Class B: 75%(Non-Spam emails).
• Now if a user wants to check that if any email contains the word cheap,
then that may be termed as Spam.
So, if the email contains the word cheap, what is the probability of it being
spam ?? (= 80%)
• Classifiers Of Machine Learning:
oDecision Trees
oBayesian Classifiers
oNeural Networks
oK-Nearest Neighbour
oSupport Vector Machines
oLinear Regression
oLogistic Regression
• Associated Tools and Languages: Used to mine/ extract useful
information from raw data.
• Main Languages used: R, SAS, Python, SQL
• Major Tools used: RapidMiner, Orange, KNIME, Spark, Weka
• Libraries used: Jupyter, NumPy, Matplotlib, Pandas, ScikitLearn, NLTK,
TensorFlow, Seaborn, Basemap, etc.
Classification and Prediction Issues…
• The major issue is preparing the data for Classification and Prediction.
Preparing the data involves the following activities −
• Data Cleaning − Data cleaning involves removing the noise and
treatment of missing values. The noise is removed by applying
smoothing techniques and the problem of missing values is solved by
replacing a missing value with most commonly occurring value for
that attribute.
• Relevance Analysis − Database may also have the irrelevant
attributes. Correlation analysis is used to know whether any two
given attributes are related.
• Data Transformation and reduction − The data can be transformed by
any of the following methods.
• Normalization − The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within
a small specified range. Normalization is used when in the learning step, the
neural networks or the methods involving measurements are used.
• Generalization − The data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.
Comparison of Classification and Prediction
Methods…
• Here is the criteria for comparing the methods of Classification and Prediction −
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the
class label correctly and the accuracy of the predictor refers to how well a given
predictor can guess the value of predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and using the
classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or predictor
efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor understands.
• Whenever we are provided with a data set we will divide it into two parts namely:
o Training set
o Test set
• Based on the training set we will build the entire system and algorithm (in case of
data mining )
• Once the algorithm is build. We need to test the accuracy of the algorithm for this
purpose we have test set.
• Testing: We will provide input from Test set to the algorithm and we will get a
Class label for the provided data then we need to tally it with already existing
class label of respective data.
• After testing the algorithm, our main motto of building an algorithm is to predict
the class labels of records whose class label is unknown.
• This process is called “CLASSIFICATION”
Decision tree…
Decision tree….
• A decision tree is a graphical representation of all the possible solutions to
a decision based on certain conditions.
• It's called a decision tree because it starts with a single box (or root), which
then branches off into a number of solutions, just like a tree.
• Example: What is Decision Tree?
• When you call a large company sometimes you end up talking to their
“intelligent computerized assistant,” which asks you to press 1 then 6, then
7, then entering your account number, 3, 2 and then you are redirected to
a harried human being.
• You may think that you were caught in voicemail hell, but the company you
called was just using a decision tree to get you to the right person.
• Decision trees are helpful, not only because they are a visual
representation that help you 'see' what you are thinking, but also
because making a decision tree requires a systematic, documented
thought process.
• Often, the biggest limitation of our decision making is that we can
only select from the known alternatives.
• Decision trees help formalize the brainstorming process so we can
identify more potential solutions.
• Applied in real life, decision trees can be very complex and end up
including pages of options.
• But, regardless of the complexity, decision trees are all based on the
same principles. Now where will you use it?
• You can use this for fraud detection or to check whether the
transaction is genuine or not.
• Suppose I am using a credit card here in India, now due to some
reason I had to fly to Dubai, now if I am using the credit card over
there, I will get a notification or alert regarding my transaction.
• They would ask me to confirm about the transaction.
• So this also a kind of predictive analysis, as the machine predicts that
something fishy is in the transaction and generates a call for
confirmation because it differs so much from my transaction history.
• You can even use it to classify different items like fruits on the basis of
its taste, color, size or weight.
• A machine well trained using the classification algorithm can easily
predict the class or the type of the fruit whenever a new data is given
to it.
• Decision Tree (DT) is a supervised learning method used for classification and regression.
• It is a tree which helps us by assisting us in decision-making!
• Decision tree builds classification or regression models in the form of a tree structure.
• It breaks down a data set into smaller and smaller subsets and simultaneously decision
tree is incrementally developed.
• The final tree is a tree with decision nodes and leaf nodes.
• A decision node has two or more branches. Leaf node represents a classification or
decision.
• We cannot do more split on leaf nodes.
• The topmost decision node in a tree which corresponds to the best predictor called root
node.
• Decision trees can handle both categorical and numerical data.
Key Factors: If the sample is completely homogeneous the entropy is zero
and if the sample is an equally divided it has entropy of one.
• Entropy- It is the measure of randomness or ‘impurity’ in the dataset.
Information Gain…
• It is the measure of decrease in entropy after the dataset is split.
• It is also known as Entropy Reduction.
• Constructing a decision tree is all about finding attribute that returns
the highest information gain (i.e., the most homogeneous branches).
Why are a decision trees useful?
• Decision trees provide a great method of Decision Making because it:
• Clearly lays out the ways so that all options can be considered.
• Allows us to analyze the possible scenarios of a decision.
• And provides a framework to quantify the values of outcomes and the
probabilities of achieving them.
• Advantages of using Decision Trees:
• Implicitly performs feature selection
• Require relatively little effort from users for data preparation
• Decision trees do not require any assumptions of linearity in the data.
Thus, we can use them in scenarios where we know the parameters are
nonlinearly related
Pros…
• Easy to build and easy to interpret. Can build simple human
interpretable business rules.
• More robust to missing data and outliers
• Can handle both numerical and categorical dependent and
independent variables.
• Role and importance of each variables can be easily assessed- White
Box
• Non parametric- doesn’t assume any relationship between
independent and dependent variables.
Cons…
• May over-fit and hence become unstable over period of time or on
new data
• Changes in the population and variables distribution may cause wild
swings in the model performance.
• May favor variables which have more levels and more categories.
• Generally less preferred over other more robust statistical techniques
when higher precision and recall is desired.
• To build a decision tree we need to calculate 2 types of entropy using
frequency tables:
Applications of Decision Trees…
• Following are the common areas for applying decision trees:
• Direct Marketing – While the marketing of products and services, business
should track products and services offered by the competitors as it identifies the
best combination of products and marketing channels that target specific sets of
consumers.
• Customer Retention – Decision trees helps organizations keep their valuable
customers and get new ones by providing good quality products, discounts, and
gift vouchers. These can also analyze buying behaviors of the customers and
know their satisfaction levels.
• Fraud Detection – Fraud is a major problem for many industries. Using
classification tree, a business can detect frauds beforehand and can drop
fraudulent customers.
• Diagnosis of Medical Problems – Classification trees identifies patients who are
at risk of suffering from serious diseases such as cancer and diabetes.
Split algorithm based on gini index…
Naïve bayes method …
• It is based on the work of Thomas bayes. Bayes was a british minister and
his theory was published only after his death.
• It is a mystery what bayes wanted to do with such calculations.
• It is quite different from decision tree approach.
• In Bayesian classification we have a hypothesis that given data belongs to a
particular class.
• We then calculate the probability for the hypothesis to be true.
• The approach requires only one scan of the whole data.
• If at some stage there are additional training data the each training
example can incrementally increase/decrease the probability that a
hypothesis is correct.
• P(A)-refers as the probability that event A will occur.
• P(A/B)-stands for the probability that event A will happen,given that event
B has already happened. It is the conditional probability.
𝐵
𝐴 𝑃 𝐴 𝑃 𝐴
•𝑃 =
𝐵 𝑃 𝐵
• If we consider to be an object to be classified then bayes theorem may be
read as given the probability of it belongs to one of the classes C1,C2,C3etc
by calculating P(Ci/X).
• Once these probabilities have been computed for all the classes,we simply
assign X to the class that has the highest conditional probability.
Estimating predictive accuracy of classification methods:
𝐶𝑖 𝑋
•𝑃 =𝑃 𝑃(𝐶𝑖)/𝑃(𝑋)
𝑋 𝐶𝑖
• The accuracy of a classification method is the ability of the method to
correctly determine the class of a randomly selected data instance.
• It may be expresses as the probability of correctly classifying unseen
data.
• The accuracy estimation problem is much easier when much more
data is available than is required for training the model.
• We would obviously like to obtain an accuracy estimate(for example,a
mean value and variance) that has low bias and low variance.
• Accuracy may be measured using a number of metrics.
• These include sensitivity, specificity,precision and accuracy.
• The methods for estimating errors include holdout, random sub-
sampling, K-fold cross validation and leave-one-out.
• Let us assume that the test data has a total of T objects.
• When testing a method we find that C of the T objects are correctly
classified.
• Error rate may be :
• 𝑒𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 = 𝐶/𝑇
Different methods predicting accuracy…
1. Confusion matrix: It not only tells us how many got classified but
also what misclassification occurred.
2. Holdout method
3. Random sub-sampling method
4. K-fold cross validation method
5. Leave-one-out method
6. Bootstrap method
Holdout method…
• Also called the test sample method.
• It requires a training set and a test set and both the sets are mutually
exclusive(cannot occur at same time).
• It may be that only one dataset is available which has been divided
into two subsets i.e the training set and the test or holdout subset.
• Once the classification method produces the model using the training
set,the test can be used to estimate the accuracy.
• Larger will be the training set better will be accuracy.
• It is important that the test data is not used in any way to create the
classifier!
• One random split is used for really large data.
• For medium sized → repeated hold-out
• Holdout estimate can be made more reliable by repeating the process with
different subsamples.
• In each iteration, a certain proportion is randomly selected for training
(possibly with stratification)
• The error rates (classification accuracies) on the different iterations are
averaged to yield an overall error rate
• Calculate also a standard deviation!
• Still not optimum: the different test sets usually overlap (difficulties
from statistical point of view).
• Can we prevent overlapping?
Random sub-sampling method…
• It is very much like the holdout method except that it does not rely on a
single test set.
• The holdout estimation is repeated several times and the accuracy
estimate is obtained by computing the mean of the several trials.
• Cross-validation avoids overlapping test sets.
• First step: data is split into k subsets of equal size.
• Second step: each subset in turn is used for testing and the remainder for
training
• This is called k-fold cross-validation.
• Often the subsets are stratified before the cross-validation is performed.
• The error estimates are averaged to yield an overall error estimate
K-fold cross validation method…
• In this the available data is randomly divided into k disjoint subsets of
approximately equal size.
• One of the subset is then used as the test set and the remaining k-1
sets are used for building the classifier.
• The test set is then used to estimate the accuracy.
• This is repeated k times so that each subset is used as a test subset
once.
• The accuracy estimate is then the mean of the estimates for each of
the classifiers.
Leave-one-out method…
• It is a simpler version of k-fold cross validation.
• In this one of the training sample is taken out and the model is
generated using the remaining training data.
• Once the model is built,the one remaining sample is used for testing
and the fresult is coded as 1 or 0 depending it was classified correctly
or not.
• The average of such results provide an estimate of the accuracy.
• This method is useful when the dataset is small.
• Fo rlarge datasets it become expensive.
Bootstrap method…
• In this method, given a dataset of size n, a bootstrap sample is
randomly selected uniformly with replacement(i.e. a sample may be
selected more than once) by sampling n times and used to build a
model.
• It can be shown that only 63.2% of these samples are unique.
• The error in building the model is estimating by using the remaining
36.8% of objects that are not in the bootstrap sample.
• The final error is then computed as 0.632 times the training error plus
0.368 times the testing error.
Other methods…
• Speed
• Robustness
• Scalibility
• Interpretability
• Goodness of the model
• Flexibility
• Time complexity
Classification software…
• C4.5
• Tree pruning
• CART
• DTREG
• SMILES

Das könnte Ihnen auch gefallen