Sie sind auf Seite 1von 14

LAB Manual

PART A
(PART A : TO BE REFFERED BY STUDENTS)

Experiment No.06
Aim: Implementation of Nave Bayes Classifier
Prerequisites:
C/C++/Java Programming
Learning Outcomes:
Concepts of Bayesian theorem and Classification
Theory:
Bayesian Theorem

Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes
theorem

P(H |X)P(X|H)P(H)
P(X)

Informally, this can be written as


posteriori = likelihood x prior/evidence
Predicts X belongs to C2 iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all
the k classes
Practical difficulty: require initial knowledge of many probabilities, significant computational
cost
Bayesian Classifier

Let D be a training set of tuples and their associated class labels, and each tuple is represented
by an n-D attribute vector X = (x1, x2, , xn)
Suppose there are m classes C1, C2, , Cm.
Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
This can be derived from Bayes theorem

P(C |X)
P(X)

P(X|Ci)P(Ci)

Since P(X) is constant for all classes, only

P(Ci|X)P(X|Ci)P(Ci)
needs to be maximized
Example
Class:
C1:buys_computer = yes
C2:buys_computer = no
Data sample
X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)
Training data:
age

income

student

credit_rating

buys_computer

<=30

high

no

Fair

no

<=30

high

no

Excellent

no

3140

high

no

Fair

yes

>40

medium

no

Fair

yes

>40

low

yes

Fair

yes

>40

low

yes

Excellent

no

3140

low

yes

Excellent

yes

<=30

medium

no

Fair

no

<=30

low

yes

Fair

yes

>40

medium

yes

Fair

yes

<=30

medium

yes

Excellent

yes

3140

medium

no

Excellent

yes

3140

high

yes

Fair

yes

>40

medium

no

Excellent

no

Sol
utio
n:
P(Ci):

P(buys_computer = yes) = 9/14 = 0.643


P(buys_computer = no) = 5/14= 0.357

Compute P(X|Ci) for each class


P(age = <=30 | buys_computer = yes) = 2/9 = 0.222
P(age = <= 30 | buys_computer = no) = 3/5 = 0.6
P(income = medium | buys_computer = yes) = 4/9 = 0.444 P(income = medium |
buys_computer = no) = 2/5 = 0.4
P(student = yes | buys_computer = yes) = 6/9 = 0.667
P(student = yes | buys_computer = no) = 1/5 = 0.2
P(credit_rating = fair | buys_computer = yes) = 6/9 = 0.667
P(credit_rating = fair | buys_computer = no) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = no) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = yes) * P(buys_computer = yes) = 0.028
P(X|buys_computer = no) * P(buys_computer = no) = 0.007
Therefore, X belongs to class (buys_computer = yes)

PART B
(PART B : TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the practical
slot. The soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge
faculties at the end of the practical in case the there is no Black board access available)

Roll No. E059


Class : B.TECH CS
Date of Experiment:
Grade :
Date of Grading:

Name: Shubham Gupta


Batch : E3
Date of Submission
Time of Submission:

B.1 Software Code written by student:


(Paste your c/c++/java code completed during the 2 hours of practical in the lab
here)

packagebayesian
; importjava.sql.*;
importjava.util.Sc
anner; /**
* @author
mpstme.student */
public class Bayesian {
/**
* @paramargs the command line
arguments */
public static void main(String[] args) {
Connection con=null;double male[]=new double[3];double
female[]=new double[3];double height_probab[][]=new double[6][3];
doubletotal_range[]=new double[3];double
probab_range[]=new double[3]; String
range[]={"Short","Medium","Tall"};
double p_t_range[]=new double[3];double
likelihood[]=new double[3];double
tot_likelihood_range=0;double p_range_t[]=new double[3];
try
{ Class.forName("sun.jdbc.odbc.JdbcOdb
cDriver") ;
con =

DriverManager.getConnection("jdbc:odbc:Classifica
tion"); Statement st = con.createStatement();

ResultSetrs;
for(inti=0;i<range.length;i++){
rs=st.executeQuery("Select count(*) from Bayesian_table
where Gender='M' and Output='"+range[i]+"'");
while(rs.next()){
male[i]=rs.getInt(1);
}
rs = st.executeQuery("Select count(*) from Bayesian_table
where Gender='F' and Output='"+range[i]+"'");
while(rs.next()){
female[i]=rs.getInt(1);
}

}
for(inti=0;i<3;i++){
double total=male[i]
+female[i];
male[i]/=total;
female[i]/=total;
}
for(inti=0;i<range.length;i++){
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>0 and Height<=1.6 and Output='"+range[i]+"'");
while(rs.next())
{ height_probab[0]
[i]=rs.getInt(1);
}
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>1.6 and Height<=1.7 and Output='"+range[i]+"'");
while(rs.next())
{ height_probab[1]
[i]=rs.getInt(1);
}
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>1.7 and Height<=1.8 and Output='"+range[i]+"'");
while(rs.next())
{ height_probab[2]
[i]=rs.getInt(1);
}
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>1.8 and Height<=1.9 and Output='"+range[i]+"'");
while(rs.next())
{ height_probab[3]
[i]=rs.getInt(1);
}
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>1.9 and Height<=2 and Output='"+range[i]+"'");
while(rs.next())
{ height_probab[4]
[i]=rs.getInt(1);
}
rs = st.executeQuery("Select count(*) from Bayesian_table
where Height>2 and Height<="+Integer.MAX_VALUE+"
and Output='"+range[i]+"'"); while(rs.next()){
height_probab[5][i]=rs.getInt(1);
}
}
for(inti=0;i<
3;i++){ int
total=0;
for(int j=0;j<6;j++)
{ total+=height_pr

obab[j][i];
}
total_range[i]=total
; for(int j=0;j<6;j+
+)
{ height_probab[j]
[i]/=total;
}
}

rs=st.executeQuery("select count(*) from


bayesian_table"); int count=0;double
prob_height[]=new double[3];
while(rs.next()){
count=rs.getInt(1);
}
for(inti=0;i<3;i++)
{ probab_range[i]=total_range[i
]/count;

System.out.println("Enter Details of the


person"); Scanner sc=new
Scanner(System.in);
System.out.println("Enter
Name"); String
name=sc.nextLine();
System.out.println("Enter Gender
M/F"); String
gender=sc.nextLine();

System.out.println("Enter
Height"); double
height=sc.nextDouble();
for(inti=0;i<3;i++)
{ if((0<height)&&(height
<=1.6)){
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>0 and Height<=1.6 and Output='"+range[i]+"'");
while(rs.next())
{ prob_height[i]=rs.
getInt(1);
}
}
else if((1.6<height)&&(height<=1.7)){
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>1.6 and Height<=1.7 and Output='"+range[i]+"'");
while(rs.next())
{ prob_height[i]=rs.
getInt(1);
}
}
else if((1.7<height)&&(height<=1.8)){
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>1.7 and Height<=1.8 and Output='"+range[i]+"'");
while(rs.next())
{ prob_height[i]=rs.
getInt(1);
}
}
else if((1.9<height)&&(height<=2)){

rs = st.executeQuery("Select count(*) from Bayesian_table where


Height>1.9 and Height<=2 and Output='"+range[i]+"'");
while(rs.next())
{ prob_height[i]=rs.
getInt(1);
}
}
else if((2<height)&&(height<Integer.MAX_VALUE)){
rs = st.executeQuery("Select count(*) from Bayesian_table
where Height>2 and Height<"+Integer.MAX_VALUE+" and
Output='"+range[i]+"'");

while(rs.next()){ prob_height[i]=rs.getInt(1);
}
}
}
for(inti=0;i<3;i++){ prob_height[i]/=total_range[i];
}
for(inti=0;i<3;i++){
if(gender.equals("M")){ p_t_range[i]=male[i]*prob_height[i];
likelihood[i]=p_t_range[i]*probab_range[i];
}
if(gender.equals("F")){ p_t_range[i]=female[i]*prob_height[i];
likelihood[i]=p_t_range[i]*probab_range[i];
}
}
for(inti=0;i<3;i++){ tot_likelihood_range+=likelihood[i];

for(inti=0;i<3;i++){ p_range_t[i]=likelihood[i]/tot_likelihood_range;

}
double max=p_range_t[0];int index=0; for(inti=1;i<3;i++)
{
if(max<p_range_t[i]){ max=p_range_t[i]; index=i;
}
}
switch(index){
case 0:System.out.println(name+" is categorised as Short"); break;
case 1:System.out.println(name+" is categorised as Medium"); break;
case 2:System.out.println(name+" is categorised as Tall"); break;
}
con.close();
}catch (Exception e) { e.printStackTrace();
System.err.println("Exception: "+e.getMessage());
}
}
}

B.2 Input and Output:

(Paste your program input and output in following format, If there is error then paste the specific error in the output
part. In case of error with due permission of the faculty extension can be given to submit the error free code with output
in due course of time. Students will be graded accordingly.)

run:
Enter Details of the person Enter Name
Adam
Enter Gender M/F M
Enter Height 1.95
Adam is categorised as Tall
BUILD SUCCESSFUL (total time: 13 seconds)

B.3 Observations and learning:


(Students are expected to comment on the output obtained with clear observations and learning for each task/ sub
part assigned)

In this experiment we have studied Bayesian algorithm which is used to decide


in which class a particular record may fall. To implement this, we created tables
in SQL Server. Then, we have connected Java to SQL server using JDBC. We
calculate probability of each class based on separate criteria. Then, we calculate
likelihood of each class. Then, we calculate probabilities of the entry being in a
particular class. The entry will fall in the class which has the maximum
probability.

B.4 Conclusion:
(Students must write the conclusions based on their learning)

Hence, we have studied and implemented Bayesian algorithm using SQL Server
and JDBC to connect to Java.

B.5 Questions of Curiosity


Q1.What are the issues in classification? Explain each with the help of an example.

Data Cleaning - Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques and
the problem of missing values is solved by replacing a missing value with
most commonly occurring value for that attribute.
Relevance Analysis - Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are
related.
Normalization - The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within
a small specified range.
Data transformation Generalize the data to higher level concepts using
concept hierarchies and normalize data which involves scaling the values.

Q2. Compare Bayesian Classification with ID3 classification technique.


Bayesian Classification:
Incrementality: The prior and likelihood can be updated dynamically. It is flexible and robust to
errors.
Combines prior knowledge and observed data

Probabilistic hypotheses: Outputs probability distribution over all classes.


Meta-classification: Outputs of several classifiers can be combined.

ID3 classification technique:


Able to generate understandable rules.

Perform classification without requiring much computation.


Able to handle both continuous and categorical values.

Provide a clear indication of which fields are most important for prediction or classification.

Q3.Summarize all approaches used for classification with their advantages and limitations.
1) Nave Bayes Classifier
Naive Bayes classifiers are highly scalable, requiring a number of
parameters linear in the number of variables (features/predictors) in a
learning problem. Maximum-likelihood training can be done by evaluating a
closed-form expression, which takes linear time, rather than by expensive
iterative approximation as used for many other types of classifiers.
Advantages:
- Combines prior knowledge and observed data
- Probabilistic hypotheses: Outputs probability distribution over all classes.
Limitations:
- Independence assumption may not hold for some attributes.
- If you have no occurrences of a class label and a certain attribute value
together (e.g. class="nice", shape="sphere") then the frequency-based
probability estimate will be zero.
- Nave Bayes classification algorithm can be used only with a small data set
but precision and recall will keep very low.
2) K-nearest neighbor
k-NN is a type of instance-based learning, or lazy learning, where the function
is only approximated locally and all computation is deferred until classification.
The k-NN algorithm is among the simplest of all machine learning algorithms.
Advantages:
Simplicity, effectiveness, intuitiveness and competitive classification
performance in many domains are the advantages. It is Robust to noisy
training data and is effective if the training data is large.
Limitations:
-

Distance based learning is not clear which typeof distance to use and which
attribute to use to produce the best results.

Computation cost is quite high because we needto compute distance of


each query instance to alltraining samples.
KNN can have poor run-time performance when the training set is large.

3) Decision Tree Learning


Decision tree learning uses a decision tree as a predictive model which
maps observations about an item to conclusions about the item's target value.
It is one of the predictive modelling approaches used in statistics, data mining

and m4achine learning. More descriptive names for


such tree models are classification trees or regression trees. In these tree
structures, leaves represent class labels and branches represent conjunctions
of features that lead to those class labels.

Advantages:
Able to generate understandable rules.
Perform classification without requiring much computation.
Able to handle both continuous and categorical values.
- Provide a clear indication of which fields are most important for prediction
or classification. Limitations:
The problem of learning an optimal decision tree is known to be NPcomplete under several aspects of optimality and even for simple concepts.
Consequently, practical decision-tree learning algorithms are based on heuristics
such as the greedy algorithm where locally-optimal decisions are made at each
node. Such algorithms cannot guarantee to return the globally-optimal decision
tree.
Decision-tree learners can create over-complex trees that do not generalise
well from the training data.
- There are concepts that are hard to learn because decision trees do not
express them easily, such as XOR, parity or multiplexer problems. In such
cases, the decision tree becomes prohibitively large.

Das könnte Ihnen auch gefallen