Sie sind auf Seite 1von 34

Image Processing Based Intelligent Traffic

Controlling and Monitoring System Using


Arduino
Master of Engineering

In

Electronics and Communication Engineering

(Digital Systems)

By

Farhana Anjum

Roll No.160318744003

Department of Electronics and Communication Engineering

Deccan College of Engineering and Technology

(Affiliated to Osmania University)

Hyderabad-500007

2018
CONTENTS

Abstract I

Chapter 1 Introduction 1
1.1 Introduction 1
1.2 Aim of the report 1
1.3 Motivation of the report 2
1.4 Research Methodology 2
1.5 Applications of the report 3
1.6 Organization of the report 3
Chapter 2 Literature Survey 4
Chapter 3 Classification and Machine learning 6
3.1 Introduction 6
3.2 Audio Classification 7
3.3 Support Vector Machine 9
3.3.1 Advantages 9
3.3.2 Limitations 9
3.4 K-Nearest Neighbors 9
3.4.1 Advantages 10
3.4.2 Limitations 10
3.5 Naïve Bayes Classification 10
3.5.1 Advantages 11
3.5.2 Limitations 11
3.6 Summary 11
Chapter 4 Classification using Supervised learning algorithms 12
4.1 Introduction 12
4.2 Classification by Support Vector Machine 12
4.3 Classification by K nearest neighbor 15
4.4 Classification by Naïve Bayes algorithm 17
4.5 Performance Parameters 19
4.6 Summary 20
Chapter 5 Results and Discussions 21
5.1 Introduction 21
5.2 Explore and analyze the signal 21
5.3 Feature Extraction 25
5.4 Classification 22
5.5 Summary 27
Chapter 6 Conclusions 28
6.1 Future Scope 28
References 29
Abstract

Audio classification has very large theoretical and practical values in both pattern recognition
and artificial intelligence.Machine learning (ML) is a field of artificial intelligence that uses
statistical techniques to give computer systems the ability to learn from data, without being
explicitly programmed.Machine learning explores the study and construction of algorithms that
can learn from and make predictions on data, such algorithms follow static program
instructions by making data-driven predictions or decisions through building a model from
sample inputs. Supervised learning and unsupervised learning are the two types of machine
learning algorithms. Classification is considered an instance of supervised learning where a
training set of correctly identified observations are available.Classification is a technique which
can be performed on structured or unstructured data where we categorize data into a given
number of classes. In this paper, we propose audio classification method based on supervised
machine learning algorithms. The features used for audio classification are mean, median,
standard deviation, dominant frequency, spectrum entropy, and Mel Frequency Cepstral
Coefficient (MFCC). The report introduces various supervised machine learning algorithms
namely Support Vector Machine (SVM), K nearest neighbor (KNN), and Naïve Bayes classifier
(NB). Support Vector Machine is a kernel-based model. K nearest neighbor is based on
neighborhood analysis. Naïve Bayes classifier is a probability-based model. In this report,
Support Vector Machine (SVM) is implemented on audio data which consists of heart sound
recordings for binary classification i.e., abnormal and normal classes. It is observed that the
amount of training time for SVM is 147.32 secs. The amount of prediction speed for SVM is
5900 obs/sec. The accuracy of classification for SVM is 87.1%.
Chapter 1
Introduction
1.1 Introduction

The development of multimedia technology and internet has brought huge multimedia
information to the people, and further resulted in the production of large-scale multimedia
information database [1][2]. As the information of internet is increasing day by day, people
urgently need effective and efficient tool to automatically classify the information, retrieve
essential information from unstructured data and assign them in a predefined category.
Classification is a technique which can be performed on structured or unstructured data where
we categorize data into a given number of classes. The main goal of a classification problem is to
identify the class to which a new data will fall under. This can be done using supervised and
unsupervised machine learning algorithms. In supervised learning data is labelled and classifier
learns to predict output. In unsupervised learning data is not labelled and classifier learns to
inherent structure.

In this report, classification of audio dataset is described. Audio information refers to one of the
most important sources of human perception.The audio data has the characteristics of complex
structure, massive data and high requirement of data processing [8].To collect necessary
information from data, significant features must be extracted. These features are used to train the
classifier. A new test instance’s class is predicted by the classifier according to the knowledge it
gained during training phase. In this report, support vector machine (SVM) which is supervised
machine learning algorithm is used to classify audio data and validation is done to measure the
performance accuracy. We use different predictive models to enhance performance accuracy and
compare them. To improve model performance, aside from trying other algorithms we optimize
the model changing model parameters.

1.2 Aim of the Report


Implementing various supervised machine learning algorithms for audio classification. To
achieve this aim the following objectives are fulfilled.

1
1. A detailed study of predictive models of classification, feature extraction methods, and
speech signal processing.
2. Classification of audio using support vector machine (SVM) algorithm.
3. K Nearest Neighbors (KNN) algorithm for classification of audio.
4. Naïve Bayes algorithm for classification of audio. Proposed technique to improve model
performance by optimizing parameters. Performance evaluation and comparative
analysis.
1.3Motivation of the Report

This report focuses on selecting the best method for classifying audio data into predefined
categories. It emphasizes on extracting features from audio files and how to turn a model into a
predictive tool by training it on new data. As the audio data has the characteristics of complex
structure, it brings great difficulty to process and analyze audio information. It is of great
importance to extract structured information. To solve this audio classification should be tackled
well. In this project, the most important features of audio are extracted and used to train the
model. Audio classification can be efficiently achieved by using supervised machine learning
algorithms like support vector machine (SVM), K nearest neighbors (KNN) and probability
based Naïve Bayes algorithm. To enhance performance accuracy optimization of model is done
by changing model parameters.

1.4Research Methodology

The development of code is done using ‘MATLAB’ software.‘MATLAB’ offers various tools
from which the ‘Statistical and Machine learning toolbox’ was used for classification. The
dataset is collected from the 2016 physioNet and computing in cardiology challenge, which
consists of thousands of recorded heart sounds ranging in length from 5 secs to 120 secs. The
dataset includes 3240 recordings for model training and 301 recordings for model validation. We
explore the data using ‘signal analyzer app in signal processing toolbox’. To speed up the feature
extraction process, we distribute the computation across available cores using the ‘parfor loop’
construct in ‘parallel computing toolbox’. To study and compare various predictive models we
use ‘classification learner app’ provided by MATLAB.

2
1.5Applications of the Report

The purpose of the report is the application of machine learning algorithms for enhancing
performance accuracy of audio classification. The applications of machine learning algorithms
for classification are:

 An active research area in information retrieval and machine learning.


 It is widely used in statistical analysis and data mining.
 It helps organizations to maintain the confidentiality and integrity of their data.
 It helps scientists to identify and study about different organisms.
 It is used in biometric identification and medical imaging.
1.6Organization of the report

This report consists of 6 chapters including introduction and conclusion.

Chapter 1: Gives the introduction, aim of the report, motivation of the report, research
methodology and applications of the report.

Chapter 2: Gives the literature survey done.

Chapter 3: Gives the types of classification using machine learning algorithms, Support vector
machine, K-Nearest Neighbors, Naïve Bayes classification, their advantages and disadvantages.

Chapter 4: Gives the mathematical analysis of Support Vector Machine, K nearest neighbor, and
Naïve Bayes classifier.

Chapter 5: Gives results of analyzing audio signals, their power spectrums, feature extraction,
scatter plot of original dataset and confusion matrix.

Chapter 6: Gives Conclusion of the report and future scope.

3
Chapter 2

Literature Survey
E. Loper, E. Klein, and S. Bird, “Preprocessing Raw Text” in Natural Language Processing with
Python. This book offers a highly accessible introduction to natural language processing, the
field that supports a variety of language technologies, from predictive text and email filtering to
automatic summarization and translation. D. Greene and P. Cunningham, “Practical Solutions to
the Problem of Diagonal Dominance in Kernel Document Clustering”. In this paper we
investigate the implications of diagonal dominance for unsupervised kernel methods, specifically
in the task of document clustering.

F. Sebastiani, “Machine Learning in Automated Text Categorization”. This survey discusses the
main approaches to text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely, document representation,
classifier construction, and classifier evaluation. J. Grimmer and B.M. Stewart, “Text as Data:
The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts”, Political
Analysis. This survey gives a wide range of new methods, provide guidance on how to validate
the output of the models, and clarify misconceptions and errors in the literature. To conclude, we
argue that for automated text methods to become a standard tool for political scientists,
methodologists must contribute new methods and new methods of validation. D.D. Lewis and M.

Ringuette, “A Comparison of Two Learning Algorithms for Text Categorization”. This paper
examines the use of inductive learning to categorize natural language documents into predefined
content categories. In this paper empirical results on the performance of a Bayesian classifier and
a decision tree learning algorithm on two text categorization datasets.

G.D. Guo, H. Wang, D. Bell, Y.X. Bi and K.Greer, “Using KNN model for automatic text
categorization”, Soft Computing. An investigation has been conducted on two well-known
similarity-based learning approaches to text categorization: the k-nearest neighbor (k-NN)
classifier and the Rocchio classifier. After identifying the weakness and strength of each
technique, a new classifier called the kNN model-based classifier (kNNModel) has been
proposed.

4
C. Cortes and V. Vapnik, “Support-vector networks”, Machine learning. In this paper high
generalization ability of support-vector networks utilizing polynomial input transformations is
demonstrated. The comparison of the performance of the support-vector network to various
classical learning algorithms that all took part in a benchmark study of Optical Character
Recognition is also done.

A.S. Patil, B.V. Pawar, “Automated Classification of Websites using Naïve Bayesian
Algorithm”. In this paper, web sites are classified based on the content of their home pages using
the Naïve Bayesian machine learning algorithm. Traditional two major methods are used for
Text Categorization (TC): one of them is rule based approach and another is machine learning
approach. In rule base method classification rules are generated by experts. This method is
accurate but not cost effective. On the other side, in machine learning approach the grouping
methods are created automatically with the help of some statistical algorithm. This approach is
cost effective and a new domain of system is easy to construct. So, this automated technique is
most famous which classify texts into their predefined categories bases on the content. A lot of
training algorithms for TC have been developed in past few years are: probability bases
algorithm Naïve Bayesian method, Decision tree learning algorithms, K-Nearest Neighbors,
Supports Vector Machine etc. Among those automated classification approaches, NB is one of
the most commonly used algorithms. It calculates the probability value of the contents belonging
to a class.

A. S. Patil & B. Pawar used Simple NB for automatic text categorization and got approximately
80% accuracy. Deep feature weighting (DFW) for Naïve Bayes was introduced where DFW
estimates the conditional probabilities of NB by deeply computing feature weighted frequencies
from training data. Furthermore, various smoothing methods are mentioned for increasing NB
learning and applied it for the classification for short text although several approaches have been
proposed, they are not faultless and still needs improvements.

5
Chapter 3

Classification and Machine learning

3.1 Introduction

Machine learning (ML) is a field of artificial intelligence that uses statistical techniques to give
systems the ability to learn from data, without being explicitly programmed. It is divided as
unsupervised and supervised machine learning algorithm. In machine learning and statistics,
classification is the problem of identifying to which set of categories a new observation belongs
based on a training set of data containing observations whose category is known. Classification
is considered as an instance of supervised learning. The following fig 3.1 shows the types of
machine learning algorithm.

Machine learning

Unsupervised learning Supervised learning

Classifiers are Classifier learns


left to themselves from the
to discover previous
structures in the examples that
data are given

Fig 3.1 Classification of machine learning algorithms

The terminologies encountered in machine learning classification are:

 Classifier: An algorithm that maps the input data to a specific category.


 Classification model: A classification model tries to draw some conclusion from the
input values given for training. It will predict the class labels for the new data.
 Feature: A feature is an individual measurable property of a phenomenon being
observed.
 Binary Classification: Classification task with two possible outcomes.
6
 Multi-class Classification: Classification with more than two classes.

3.2 Audio Classification

Audio Classification is a technique where we classify the audio data into given number of
classes. It belongs to the category of supervised learning where the targets also provided with the
input data. There are many applications in classification in many domains such as in credit
approval, medical diagnosis, target marketing etc.

The following are the steps involved in building classification model:

1. Access and explore the data.


2. Preprocess the data and extract features.
3. Develop the predictive model.

The description of work done in classification task is shown in fig 3.2.

Access audio Data

Pre-process

Training Extract features Train model


datasets

Predictive Optimize
Model Parameters

Testing datasets Extract features Run model

Prediction

Fig 3.2 Work flow of Classification

7
 Access and Explore the data
The dataset from physioNet library includes 3240 heart sound recordings for training the
model, and 301 recordings for testing the model. A standard procedure in machine
learning is we split the data into training and testing sets and store it in separate folders.
Common ways to explore data includes inspecting some examples, creating
visualizations and applying signal processing techniques to identify patterns.
 Preprocess the data and extract features
Preprocessing the data include removing outliers and trends, imputing missing data and
normalizing the data. These tasks are not required for our dataset as it has already been
preprocessed by the organizers. Feature extraction is one of the most important task in
machine learning because it turns raw data into information that is suitable for machine
learning algorithms.
 Develop the predictive model
Developing the predictive model is an iterative process. No single machine learning
algorithm works for every problem, and identifying the right algorithm is often a process
of trial and error. We can select an individual classifier or multiple classifiers and train
them in parallel and compare their performance. The different kinds of algorithms are
shown in fig 3.3.

Classification

Unsupervised Supervised
algorithms algorithms

Clustering SVM algorithm KNN algorithm Naïve Bayes


algorithm

Fig 3.3 Types of machine learning algorithms for classification

8
3.3 Support Vector Machine

Support Vector Machine (SVM) [3] is a supervised learning algorithm for classification and
regression. Given a set of n-dimensional vectors in vector space, SVM finds the separating
hyper-plane that splits vector space into sub-set of vectors; each separated sub-set is assigned one
class. There is the constraint for this separating hyper-plane that it must maximize the margin
between two sub-sets. Suppose we have some n-dimensional vectors, each of them belongs to
one of two classes. We can find n-1 dimensional hyper-planes that classify such vectors but there
is only one hyper-plane that maximizes the margin between two classes. In other words, the
nearest between a point in one side of this hyper-plane and other side of hyper-plane is
maximized. Such hyper-plane is called maximum-margin hyper-plane and it is considered as
maximum-margin classifier. According to the rule of classification in SVM, a new vector is
classified into either of the two classes.

3.3.1 Advantages

 Learning results are more robust.


 It has regularization parameter which reduces over-fitting problem.
 It works well with fewer training samples.

3.3.2 Disadvantages

 If the points on the boundaries are not informative due to noise, SVMs will not do well.
 Can be computationally expensive.
 Problem need to be formulated as it supports only two-class classification i.e., multi-class
classification is not possible.

3.4 K-Nearest Neighbors

The k-nearest neighbors’ algorithm (k-NN) [4] is a non-parametric method used for classification
and regression. In both cases, the input consists of the k closest training examples in the feature
space. The output depends on whether k-NN is used for classification or regression. In k-NN
classification, the output is a class membership. An object is classified by a majority vote of its

9
neighbors, with the output being assigned to the class most common among its k nearest
neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to
the class of that single nearest neighbor. In k-NN regression, the output is the property value for
the object. This value is the average of the values of its k-nearest neighbors. K-NN is a type of
instance-based learning, where the function is only approximated locally, and all computation is
deferred until classification. The k-NN algorithm is among the simplest of all machine learning
algorithms.

3.4.1 Advantages

 Robust to noisy training data.


 Effective if the training data is large.

3.4.2 Disadvantages

 Need to determine value of k.


 Distance based learning is not clear, which type of distance to use and which type of
attribute to use to produce the best results.
 Computation cost is quite high because we need to compute distance of each query
instance to all training samples.

3.5 Naïve Bayes Classification

Naïve Bayes (NB) [6] is a probability based supervised learning algorithm. It is famous for its
conceptual and computational simplicity. NB is a multinomial supervised learning method which
is based on probabilistic measurement applying Bayes theorem. For a supervised learning setting
probabilistic model supported NB classifiers can be trained very efficiently. In NB two types of
probabilistic function are calculated for training, one is prior probability, and another is
conditional probability. This calculated probability values gained in training are then used in
testing process where we predict the probability for a new instance to be in a particularclass.
Probability of an individual class is the ratio of number of samples for an individual class to the
total number of samples. This is called the prior probability. Likelihood function of an individual
given class is calculated, this is also called conditional probability. Then the posteriori

10
probability is calculated. The maximum of the posteriori probability is considered as the class of
a new test sample.

3.5.1 Advantages

 Very simple, easy to implement and fast.


 If the NB conditional independence assumption holds, then it will converge quicker
than discriminative models like logistic regression.
 Even if the NB assumption doesn’t hold, it works great in practice.

3.5.2 Disadvantages

 Assumes independence of features.


 Performance accuracy decreases if the training data is long.

3.6 Summary
In this chapter, brief overview of supervised and unsupervised machine learning is
described.The process of classification is briefly described. Different methods of
classification using supervised machine learning algorithmsare described. Support Vector
Machine algorithm, its advantages and limitations are described. K-nearest neighbors’
method, its advantages and limitations are described. Naïve Bayes classification, its
advantages and limitations are described.

11
Chapter 4

Classification using Supervised learning algorithms

4.1 Introduction

Classification refers to categorizing a new data sample into a particular class. It is the problem
of identifying to which of a set of categories a new observation belongs, based on a training set
of data containing observations or instances whose category membership is known.
Classification is considered an instance of supervised learning i.e., learning where a training set
of correctly identified observations are available. The corresponding unsupervised learning is
known as clustering. Supervised learning is the machine learning task of learning the function
that maps an input to an output based on example input-output pairs. It infers a function
from labeled training data consisting of a set of training examples. In supervised learning, each
example is a pair consisting of an input vector and a desired output value. A supervised
learning algorithm analyzes the training data and produces an inferred function, which can be
used for mapping new examples. An optimal scenario will allow for the algorithm to correctly
determine the class labels for unseen instances. This requires the learning algorithm to
generalize from the training data to a new vector.A supervised machine learning algorithm is a
procedure which is executed iteratively by comparing various results till a satisfactory result is
obtained.The purpose of classification is to design a predictive model which gives accurate and
best results.

4.2 Classification by Support Vector Machine algorithm

Support Vector Machine (SVM) [9]is a supervised machine learning algorithm which is used
for binary classification. It is a kernel-based algorithm. It can be employed for both
classification and regression. SVMs are more commonly used in classification
problems.Support vectors are the data points nearest to the hyperplane, the points of a data set
that, if removed, would alter the position of the dividing hyperplane.Because of this, they can
be considered the critical elements of a data set.It finds the separating hyper-plane that splits
vector space into subset of vector and each separated subset is assigned one class.A hyperplane

12
is a line that linearly separates and classifies a set of data.So, when new testing data is added,
whatever side of the hyperplane it lands will decide the class that we assign to it.

Let {X1, X2, X3…. Xn}be the training set of vectors and let yi= {1, -1}be the class label of
vector Xi. It is necessary to determine the maximum-margin hyper-plane that separates vectors
belonging to yi=1from vectors belonging to yi=-1. This hyper-plane is written as the set of
point satisfying:

W T ⊗ Xi + b = 0 (4.1)

Where ⊗ denotes the scalar product and W is a weight vector perpendicular to hyper-plane
and b is the bias. W is also called perpendicular vector or normal vector. It is used to specify
hyper-plane.
b
The value |W| is the offset of the hyper-plane from the origin along the weight vector W.

To calculate the margin, two parallel hyper-planes are constructed, one on each side of the
maximum-margin hyper-plane. Such two parallel hyper-planes are represented by two
following equations.

W T ⊗ Xi + b = 1 (4.2)

(4.3)
W T ⊗ Xi + b = −1

To prevent vectors falling into the margin, all vectors belonging to two class yi=1, yi=-1 have
two following constraints respectively:
(4.4)
W T ⊗ Xi + b ≥ 1 for yi = 1
(4.5)
W T ⊗ Xi + b ≤ −1 for yi = −1

These constraints can be re-written as:


(4.6)
Yi (W T ⊗ Xi + b) ≥ 1

For any new vector X, the rule for classifying it is computed as below:

f(Xi ) = sign(W T ⊗ Xi + b) ∈ {≤ −1, ≥ 1} (4.7)

The flowchart of Support Vector Machine (SVM) algorithm is shown in Fig 4.1.

13
Start

Audio dataset

Trainingdataset Testingdataset

Feature Vectors

Scaling Feature Vectors

Find hyper-plane

Find maximize margin

Create SVM train


Cross Validation
model

Confusion Matrix

Stop

Fig 4.1 Flowchart of Support Vector Machine

The following are the steps of classification using SVM algorithm:

Step 1: Split the dataset into training and testing datasets.

Step 2: Feature vectors are extracted from training and testing sets.

Step 3: Extracted feature vectors are scaled.

Step 4: The hyper-plane which separates the vector space into subsets is determined.

Step 5: Calculate the maximum margin by constructing parallel hyper-planes.

14
Step 6: Apply Cross Validation.

Step 7: Model SVM classifier.

Step 8: The new vector is classified into one of the two classes according to the training phase
and predicted output is shown through confusion matrix.

4.3 Classification by K Nearest Neighbor

The K-nearest neighbor (KNN) [11] algorithm is among the simplest of all machine learning
algorithms. In this algorithm, an object is classified by a majority vote of its neighbors. The
object is consequently assigned to the class that is most common among its K nearest
neighbors, where K is a positive integer that is typically small. Then the object is simply
assigned to the class of its nearest neighbor.

The KNN algorithm is first implemented by introducing some notations S = {xi, yi},
i=1,2,….N is considered the training set, where xi is the d-dimensional feature vector, and
yi∈{1,-1} is associated with the observed class labels. We generally suppose that all training
data are the samples of random variables with unknown distribution.

With previously labeled samples as the training set S, the KNN algorithm constructs a local
subregion R(x) of the input space, which is situated at the estimation point x. The predicting
region R(x) contains the closest training points to x, which is written as follows:

R(x) = {x̂|D(x, x̂) ≤ d(k) } (4.8)

Where d(k) is the kth order statistic of{D(x, x̂)}1N , and D(x, x̂) isthe distance
metric. K[y] denotes the number of samples in region R(x), which is labeled y. The KNN
algorithm is statistically designed for the estimation of posterior probability P(y/x) of the
observation point x:
x
y P (y) P(y) k[y] (4.9)
P( ) = ≅
x P(x) k

For a given observation x, the decision g(x) is formulated by evaluating the values of k[y] and
selecting the class that has the highest k[y] value

1, k[y = 1] ≥ k[y = −1] (4.10)


g(x) = {
−1, k[y = −1] ≥ k[y = 1]

15
Thus, the decision that maximizes the associated posterior probability is employed in the KNN
algorithm. For a binary classification problem in which yi∈ {1, -1}, the KNN algorithm
produces the following decision rule:

g(x) = sgn(avexi ∈R(x) yi ) (4.11)

The flowchart of K nearest neighbor (KNN) algorithm is shown in Fig 4.2

Start

Split data into training and testing sets

Read value of K, type of distance D, and


testing dataset

Cross validate the data

Set maximum label class of K to the test


data

Predicted output

Stop

Fig 4.2 Flowchart of K nearest neighbor algorithm

The following are the steps of classification using KNN algorithm:

Step 1: Split the dataset into training and testing datasets.

Step 2: Initialize the value of K and know the type of distance used.

Step 3: For a new test instance, find the K number of nearest neighbors to it.

Step 4: Give the label of maximum value of K to the test data.

16
Step 5: Show the predicted output and compare it to the true value and find the accuracy.

4.4 Classification by Naïve Bayes classifier

Naïve Bayes (NB) is a multinomial supervised learning method which is based on probabilistic
measurement applying Bayes theorem. For a supervised learning setting probabilistic model
supported NB classifiers can be trained very efficiently. In NB two types of probabilistic
function are calculated for training, one is prior probability, and another is conditional
probability. This calculated probability values gained in training are then used in testing process
where we predict the probability for a new instance to be in a particular class.

Probability of an individual class P(Cn ) = Number of instancesof an individual class / Total


number of instances. Here, P(Cn ) is the prior probability. Likelihood function of an individual
given class is calculated, this is also called conditional probability.

P(Wn |Cn ) = [count (Wn |Cn ) + 1]/[count(Cn ) + |Vn |]


(4.12)
Here,count (Wn |Cn ) is the count of each individual word according to the class (Cn ) ;
count(Cn )is the count of all words in the original class ; Vn is the number of vocabulary.

The frequencies of each isolated informative words of each particular class are f1 , f2 ………,fn . .
Now, Posteriori probability for each class,

P(Cn |Wn ) = argmax P(Cn ) ∗ P(Wn |Cn )fn (4.13)


Cn ∈C

For a new sample, a class is found as the maximum of the posteriori probability which is given
as the following equation.

P(x) = argmax[P(Cn |Wn )] (4.14)

The flowchart of Naïve Bayes classifier is as shown in Fig 4.3

17
Start

Dataset

Testing dataset Training dataset

Choose feature subset

Cross Validation

Smallest
error?
Yes
No

Selected features

Model testing data

Confusion matrix

Stop

Fig 4.3 Flowchart of Naïve Bayes classifier

The following are the steps of classification using Naïve Bayes algorithm:

Step 1: Split the dataset into training and testing sets.

Step 2: Features from training sets are extracted and a subset of features are taken.

Step 3: These extracted features are subjected to cross validation.

Step 4: If cross validation has smallest error repeat from step 2 else few features are selected
from the subset.

18
Step 5: These selected features and the extracted features from the testing dataset are given to
the model for validation.

Step 6: Then the predicted output is displayed through confusion matrix.

4.5 Performance Parameters

 Mean: The average that is used to derive central tendency of the data.
n
1 x1 + x2 + ⋯ + xn (4.15)
x̅ = (∑ xi ) =
n n
i=1

Where n is the total number of samples and {x1, x2....xN) are the samples.
 Median:The value separating the higher half from the lower half of the data samples.
1
Median = (n + ) (4.16)
2
Where n is the number of samples.
 Standard Deviation: A measure that is used to quantify the amount of variation of a
set of data values.
(4.17)
∑N
i=1 xi − x
̅
σ= √
N−1
Where {x1, x2....xN) are the samples, x̅is the mean value and N is number of
observations.
 Dominant Frequency: The frequency of sinusoidal component with highest amplitude.
 Spectrum Entropy: A measure of spectral power distribution of signal which is based
on Shannon entropy.
n

SE = − ∑ pi ln pi (4.18)
i=1

Where pi is probability density function and n is the number of samples.


 Mel Frequency Cepstral Coefficients(MFCCs): Coefficients that are derived from a
type of cepstral representation of the audio clip.
𝑘
1 𝑛
𝐶𝑛 = ∑(log 𝐷𝑘 ) cos[𝑚(𝑘 − ) ]
2 𝑘
𝑘−1

Where m=0,1, 2......k-1 and m is number of coefficients.

19
4.6 Summary

In this chapter, a brief overview of various supervised machine learning algorithms for
classification is discussed. The various supervised algorithms like Support Vector Machine, K-
nearest neighboralgorithm, Naïve Bayes algorithm are discussed in detail along with
implementation steps and flowcharts. The different performance parameters which are used for
audio data classificationare also described.

20
Chapter 5

Results and Discussion

5.1 Introduction

The Support Vector Machine algorithm is implemented in MATLAB.We use the dataset from
the 2016 phisioNet and computing in cardiology challenge, which consist of thousands of
recorded heart sounds ranging in length from 5 seconds to 120 seconds.The dataset consists of
3240 samples. Feature extraction is done and the features such as dominant frequency, spectrum
entropy and MFCCs are extracted. We train Support Vector Machine (SVM) classifier applying
5-fold cross validation and develop a predictive model for audio classification. Confusion matrix
is used to describe the performance of Support Vector Machine classifier. The main aim is to
classify the audio dataset more accurately.

The analysis and results in the report are summarized as follows:

1 Explore the heart sound data and analyzing the signal.

2 Extracted features from heart sound data.

3 Scatter plot of original audio data.

4 Confusion matrix of SVM classifier.

5.2 Explore and Analyze the signal

The audio dataset includes 3240 recordings for training and 301 recordings for validation. The
training and testing data are kept in two separate folders. Common ways to explore data includes
creating visualizations and applying signal processing techniques. The one normal heart sound
wave and an abnormal heart sound wave is plotted. We notice that the abnormal heart sound has
higher frequencies with noise between beats. The normal heart sound is more regular with
silence beats. The signal of normal heart beat is shown in Fig 5.1 and a signal of an abnormal
heart beat is shown in Fig 5.2.

21
Fig 5.1 Normal heart wave

Fig 5.2 Abnormal heart wave

The normal and abnormal heart signals are analyzed,and their power spectrums are plotted. The
time plot of normal heart wave is shown in Fig 5.3 and abnormal heart wave is shown in Fig 5.4.
The power spectrum of abnormal heart signal is shown in Fig 5.5 and abnormal heart signal is
shown in Fig 5.6.

22
PCG_normal

0.2

-0.2

0 3.0e+3 6.0e+3 9.0e+3 1.2e+4 1.5e+4 1.8e+4 2.1e+4

Fig 5.3 Time plot of all normal heart signals


The frequencies of normal heart wave are normalized, and the normalized frequency is taken to
plot the power spectrum of it. The power spectrum of normal heart signal is shown in Fig 5.4.

-60
Power spectrum (dB)

-80

-100
0 0.2 0.4 0.6 0.8 1.0
Normalized frequency

Fig 5.4 Power spectrum of normal heart signals

23
PCG_abnormal
0.10

0.05
amplitude (dB)

-0.05

-0.10

0 3.0e+3 6.0e+3 9.0e+3 1.2e+4 1.5e+4 1.8e+4 2.1e+4


samples

Fig 5.5 Time plot of abnormal heart signal


The frequencies of normal heart wave are normalized, and the normalized frequency is taken to
plot the power spectrum of it. The power spectrum of normal heart signal is shown in Fig 5.4.

-50

-60
Power spectrum (dB)

-70

-80

-90

-100
0 0.2 0.4 0.6 0.8 1.0
Normalized frequency

Fig 5.6 Power spectrum of abnormal heart signal

24
5.3 Feature Extraction

Feature extractionis one of the most important parts of machine learning because it turns raw
data into information that’s suitable for machine learning algorithms. Feature extraction
eliminates the redundancy present in many types of measured data, facilitating generalization
during the learning phase. Generalization is critical to avoiding overfitting the model to specific
examples.

In the heart sounds dataset, we extract the following types of features:


 Summary statistics: mean, median, and standard deviation

 Frequency domain: dominant frequency, spectrum entropy, and Mel Frequency


Cepstral Coefficients (MFCCs).
Extracting these types of features listed above yields 18 features from the audio signal. A few
extracted features are listed in Table 5.1.
Table 5.1 Extracted features of few samples of audio dataset
Samples Mean Median Standard Spectral Dominant MFCC
deviation Entropy frequency
HB1 -2.712e-5 0.000152 0.02033 0.28682 17.098 88.196
HB2 -4.330e-6 6.1035e-5 0.021358 0.29779 15.633 88.048
HB3 1.645e-5 0.000366 0.021588 0.23184 26.38 90.012
HB4 -7.897e-5 -0.000152 0.019643 0.25299 24.915 87.303

5.4 Classification

Before training actual classifiers, we need to divide the data into a training and a validation set.
The validation set is used to measure accuracy during model development. For large datasets
such as the heart sounds data, holding out a certain percentage of the data is appropriate; cross-
validation is recommended for smaller datasets because it maximizes how much data is used for
model training, and typically results in a model that generalizes better. The scatter plot of
original dataset is shown in Fig 5.7.

25
0.2

0.1

Median value 0

-0.1

-0.2

-0.3

-0.4

-0.5
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2
Mean value

Fig 5.7 Scatter plot of original data


In the scatter plot above the red colour marks are the feature vectors of abnormal heart signal and
blue colour marks are the feature vectors of normal heart signal.The predicted output class of
samples compared to the actual or true class of samples for Support Vector Machine (SVM)
classifier is given in confusion matrix. The confusion matrix for SVM classifier is shown in Fig
5.8.

Fig 5.8 Confusion matrix for SVM classifier

26
From the confusion matrix it is noticed 2160 heart signal samples are correctly classified and 998
are misclassified into normal class. Similarly, 9174 are correctly classified as normal heart
samples and 683 samples are misclassified into abnormal class. So, the accuracy for SVM
classifier is 87.1%. The amount of training time is 147.32 secs. The amount of prediction speed
is 5900 obs/sec.Instead of using scatter plots and confusion matrices to explore the trade-off
between true and false positives, we could use a receiver operating characteristic (ROC) curve.
The ROC curve is a useful tool for visually exploring the trade-off between true positives and
false positives. The ROC curve for SVM model is shown in Fig 5.9.

Fig 5.9 ROC curve of SVM model

5.5 Summary

In this chapter, Support Vector Machine (SVM) algorithm used to classify the audio data is
described. Support Vector Machine can be used for only binary classification. The brief
description of analysis of signals and results for classification are given. The signals of a normal
heart beat and an abnormal heart beat are analyzed. The time plot and power spectrum of both
abnormal and normal heart signals are plotted. The scatter plot of complete audio data is
discussed. The confusion matrix for SVM classifier is described. The accuracy of SVM classifier
for binary classification is noticed.

27
Chapter 6
Conclusions
In this report, various supervised machine learning algorithms are analyzed namely, Support
Vector Machine, K nearest neighbor and Naïve Bayes classifier. The objective of this report is to
implement Support Vector Machine (SVM), K nearest neighbor (KNN), and Naïve Bayes (NB)
classifier to find the best suited supervised learning algorithm for classification. The Support
Vector Machine (SVM) approach proposed in this paper utilizes the extracted features of audio
data namely, mean, median, standard deviation, dominant frequency, spectral entropy, and
thirteen Mel Frequency Cepstral Coefficients (MFCCs). So total 26 features are extracted for the
classification task. The normal and abnormal heart beats are analyzed, and it is noticed that
abnormal heart beat has higher frequencies compared to normal heart beat. The time and power
spectral signals of normal and abnormal heart beats are noticed. The 5-fold cross validation is
used for classification. It is found that the amount of time taken for training SVM is 147.32s sec.
It is noticed that the amount of prediction speed for SVM classification is 5900 obs/sec. The
accuracy obtained for classification using SVM classifier is 87.1%.

6.1 Future Scope

 The data classification system can further be improved for use in specialized areas as in
the fields of sensor networks, intrusion detection.
 In future, the hybrid networks must be designed and implemented to classify a huge set of
data to achieve improved performance to prove its effectiveness.
 SVM classifier can be further improved to be used for multi-class classification by
cascading two or more SVM classifiers.
 The performance and accuracy of classification can further be improved using neural
networks where we do not need to train the classifier manually.

28
References
[1] E.Loper, E.Klein, and S.Bird, “Preprocessing Raw Text” in Natural Language Processing
with Python, 1st ed. O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472, 10 July 2009, ch. 3, pp. 79-123.

[2] D.Greene and P.Cunningham, “Practical Solutions to the Problem of Diagonal Dominance in
Kernel Document Clustering”, Proc. ICML 2006.

[3] F.Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing


Surveys, vol. 34, No. 1, March 2002, pp. 1-47.

[4] J.Grimmer and B.M.Stewart, “Text as Data: The Promise and Pitfalls of Automatic Content
Analysis Methods for Political Texts”, Political Analysis, January 2013, pp. 1-31.

[5] D.D.Lewis and M.Ringuette (1994), “A Comparison of Two Learning Algorithms for Text
Categorization”, In Proc. Of Third Annual Symposium on Document Analysis and Information
Retrieval, pp. 81-93.

[6] G.D.Guo, H.Wang, D.Bell, Y.X.Bi and K.Greer, “Using KNN model for automatic text
categorization”, Soft Computing , 10(5), pp. 423-430,2006.

[7] C.Cortes and V. Vapnik, “Support-vector networks”, Machine learning, Vol. 20, pp. 273-297,
1995.

[8] A.S. Patil, B.V. Pawar, “Automated Classification of Websites using Naïve Bayesian
Algorithm”, Proceedings of the International MultiConference of Engineers and Computer
Scientists, Hong Kong, 2012, Vol. 1, 14-16.

[9] L.Jiang, C.Li, S.Wang, and L.Zhang, “Deep feature weighting for Naïve Bayes and its
application to text classification”, Engineering Applications of Artificial Intelligence, vol. 52,
June 2016, pp. 26-39.

[10] Q.Yuan, G.Cong, and N.M.Thalmann, “Enhancing Naïve Bayes with Various Smoothing
Methods for Short Text Classification”, WWW 2012 Companion, April 16-20, 2012, Lyon.
France, ACM 978-1-4503-1230-1/12/04.

29
[11] V.Lertnattee and T.Theeramunkongt, “Analysis of Inverse Class Frequency in Centroid-
based Text Classification”, International Symposium on Communication and Information
Technologies 2004 (ISCIT 2014), pp. 1171-1176, Sapporo, Japan, October 26-29, 2004.

[12] David M W Powers, “Evaluation From Precision, Recall and F-Factor to ROC,
Informedness, Markedness& Correlation”, School of Informatics and Engineering, Flinders
University of South Australia, Technical Report SIE-07-001, December 2007.

30

Das könnte Ihnen auch gefallen