Facial Expression Recognition System Using Convolutional Neural Network

FACIAL EXPRESSION RECOGNITION SYSTEM USING
CONVOLUTIONAL NEURAL NETWORK
ABSTRACT:
A Facial expression is the visible manifestation of the affective state,

cognitive activity, intention, personality and psychopathology of a person and
plays a communicative role in interpersonal relations. Automatic recognition of
facial expressions can be an important component of natural human-machine
interfaces; it may also be used in behavioral science and in clinical practice.
An automatic Facial Expression Recognition system needs to perform detection
and location of faces in a cluttered scene, facial feature extraction, and facial
expression classification.
Facial expression recognition system is implemented using Convolution

Neural Network (CNN). CNN model of the project is based on LeNet
Architecture. Kaggle facial expression dataset with seven facial expression labels
as happy, sad, surprise, fear, anger, disgust, and neutral is used in this project. The
system achieved 56.77 % accuracy and 0.57 precision on testing dataset.
Keywords: Facial Expression Recognition, Convolutional Neural Network, Deep

Learning, Theano
1. OVERVIEW
1.1 Background
A Facial expression is the visible manifestation of the affective state,

cognitive activity, intention, personality and psychopathology of a person and
plays a communicative role in interpersonal relations. Human facial expressions
can be easily classified into 7 basic emotions: happy, sad, surprise, fear, anger,
disgust, and neutral. Our facial emotions are expressed through activation of
specific sets of facial muscles. These sometimes subtle, yet complex, signals in an
expression often contain an abundant amount of information about our state of
mind.
Automatic recognition of facial expressions can be an important component of

natural human-machine interfaces; it may also be used in behavioral science
and in clinical practice. It have been studied for a long period of time and
obtaining the progress recent decades. Though much progress has been made,
recognizing facial expression with a high accuracy remains to be difficult due to
the complexity and varieties of facial expressions [1].
On a day to day basics humans commonly recognize emotions by characteristic

features, displayed as a part of a facial expression. For instance happiness is
undeniably associated with a smile or an upward movement of the corners of
the lips. Similarly other emotions are characterized by other deformations
typical to a particular expression. Research into automatic recognition of facial
expressions addresses the problems surrounding the representation and
categorization of static or dynamic characteristics of these deformations of face
pigmentation [2].
In machine learning, a convolutional neural network (CNN, or ConvNet) is a type

of feed-forward artificial neural network in which the connectivity pattern
between its neurons is inspired by the organization of the animal visual cortex.
Individual cortical neurons respond to stimuli in a restricted region of space known
as the receptive field. The receptive fields of different neurons partially overlap
such that they tile the visual field.
The response of an individual neuron to stimuli within its receptive field can be
approximated mathematically by a convolution operation.[3] Convolutional
networks were inspired by biological processes[4] and are variations of
multilayer perceptron designed to use minimal amounts of preprocessing.[5]
They have wide applications in image and video recognition, recommender
systems and natural language processing.
The convolutional neural network is also known as shift invariant or space

invariant artificial neural network (SIANN), which is named based on its shared
weights architecture and translation invariance characteristics. LeNet is one of the
very first convolutional neural networks which helped propel the field of Deep
Learning. This pioneering work by Yann LeCun was named LeNet5 was used
mainly for character recognition tasks such as reading zip codes, digits, etc. The
basic architecture of LeNet can be shown as below [5]:
Figure 1.1: A model of CNN [5]
There are four main operations in the Convolution Neural Network shown in
Figure 2.2 above:
1.Convolution:
The primary purpose of Convolution in case of a CNN is to extract features from

the input image. Convolution preserves the spatial relationship between pixels
by learning image features using small squares of input data. The convolution
layer’s parameters consist of a set of learnable filters. Every filter is small spatially
(along width and height), but extends through the full depth of the input volume.
For example, a typical filter on a first layer of a CNN might have size 3x5x5 (i.e.
images have depth 3 i.e. the color channels, 5 pixels width and height).
During the forward pass, each filter is convolved across the width and height
of the input volume and compute dot products between the entries of the filter
and the input at any position. As the filter convolve over the width and height of
the input volume it produces a 2-dimensional activation map that gives the
responses of that filter at every spatial position. Intuitively, the network will
learn filters that activate when they see some type of visual feature such as an 3
edge of some orientation or a blotch of some color on the first layer, or
eventually entire honeycomb or wheel-like patterns on higher layers of the
network.
Now, there will be an entire set of filters in each convolution layer (e.g. 20
filters), and each of them will produce a separate 2-dimensional activation map.
The 2-dimensional convolution between image A and Filter B can be given as:
A filter convolves with the input image to produce a feature map. The
convolution of another filter over the same image gives a different feature map.
Convolution operation captures the local dependencies in the original image. A
CNN learns the values of these filters on its own during the training process
(although parameters such as number of filters, filter size, architecture of the
network etc. still needed to specify before the training process). The more
number of filters, the more image features get extracted and the better network
becomes at recognizing patterns in unseen images.
The size of the Feature Map (Convolved Feature) is controlled by three

parameters
 Depth: Depth corresponds to the number of filters we use for the

convolution operation.
 Stride: Stride is the size of the filter, if the size of the filter is 5x5 then
stride is 5.
 Zero-padding: Sometimes, it is convenient to pad the input matrix with
zeros around the border, so that filter can be applied to bordering
elements of input image matrix. Using zero padding size of the feature
map can be controlled.
2. Rectified Linear Unit:
An additional operation called ReLU has been used after every Convolution
operation. A Rectified Linear Unit (ReLU) is a cell of a neural network which uses
the following activation function to calculate its output given x:
R(x) = Max(0,x) (2.2)
Using these cells is more efficient than sigmoid and still forwards more
information compared to binary units. When initializing the weights uniformly,
half of the weights are negative. This helps creating a sparse feature representation.
Another positive aspect is the relatively cheap computation. No exponential
function has to be calculated. This function also prevents the vanishing gradient
error, since the gradients are linear functions or zero but in no case non-linear
functions.
3. Pooling (sub-sampling) :
Spatial Pooling (also called subsampling or downsampling) reduces the

dimensionality of each feature map but retains the most important information.
Spatial Pooling can be of different types: Max, Average, Sum etc. In case of Max
Pooling, a spatial neighborhood (for example, a 2×2 window) is defined and the
largest element is taken from the rectified feature map within that window. In case
of average pooling the average or sum of all elements in that window is taken. In
practice, Max Pooling has been shown to work better.
Max Pooling reduces the input by applying the maximum function over the input
xi. Let m be the size of the filter, then the output calculates as follows:
The function of Pooling is to progressively reduce the spatial size of the input
representation. In particular, pooling
 Makes the input representations (feature dimension) smaller and more

manageable
 Reduces the number of parameters and computations in the network,
therefore, controlling over-fitting
 Makes the network invariant to small transformations, distortions and
translations in the input image (a small distortion in input will not change the
output of Pooling.
 Helps us arrive at an almost scale invariant representation. This is very
powerful since objects can be detected in an image no matter where they are
located.
4. Classification (Multilayer Perceptron):
The Fully Connected layer is a traditional Multi-Layer Perceptron that uses a

softmax activation function in the output layer. The term “Fully Connected”
implies that every neuron in the previous layer is connected to every neuron on
the next layer. The output from the convolutional and pooling layers represent
high-level features of the input image.
The purpose of the Fully Connected layer is to use these features for classifying the
input image into various classes based on the training dataset. Softmax is used for
activation function. It treats the outputs as scores for each class. In the Softmax, the
function mapping stayed unchanged and these scores are interpreted as the un-
normalized log probabilities for each class. Softmax is calculated as:
where j is index for image and K is number of total facial expression class.
Apart from classification, adding a fully-connected layer is also a (usually)

cheap way of learning non-linear combinations of these features. Most of the
features from convolutional and pooling layers may be good for the classification
task, but combinations of those features might be even better.
The sum of output probabilities from the Fully Connected Layer is 1. This is
ensured by using the as the activation function in the output layer of the Fully
Connected Layer. The Softmax function takes a vector of arbitrary real-valued
scores and squashes it to a vector of values between zero and one that sum to one.
LITERATURE REVIEW
1. TOPIC: Robust facial expression recognition using local binary patterns.
AUTHOR’S: Shan, C., Gong, S., & McOwan, P. W.
DESCRIPTION:
A novel low-computation discriminative feature space is introduced for facial

expression recognition capable of robust performance over a rang of image
resolutions. Our approach is based on the simple local binary patterns (LBP) for
representing salient micro-patterns of face images. Compared to Gabor wavelets,
the LBP features can be extracted faster in a single scan through the raw image and
lie in a lower dimensional space, whilst still retaining facial information efficiently.
Template matching with weighted Chi square statistic and support vector machine
are adopted to classify facial expressions. Extensive experiments on the Cohn-
Kanade Database illustrate that the LBP features are effective and efficient for
facial expression discrimination. Additionally, experiments on face images with
different resolutions show that the LBP features are robust to low-resolution
images, which is critical in real-world applications where only low-resolution
video input is available.
2. TOPIC: Facial expression recognition: A brief tutorial overview.
AUTHOR ‘S: Chibelushi, C. C., & Bourel, F..
DISCRIPTION: Facial expressions convey non-verbal cues, which play an

important role in interpersonal relations. Automatic recognition of facial
expressions can be an important component of natural human-machine interfaces;
it may also be used in behavioural science and in clinical practice. Although
humans recognise facial expressions virtually without effort or delay, reliable
expression recognition by machine is still a challenge. This paper presents a high-
level overview of automatic expression recognition; it highlights the main system
components and some research challenges.
3. TOPIC: "Subject independent facial expression recognition with robust face

detection using a convolutional neural network"
AUTHOR’S: Matusugu, Masakazu; Katsuhiko Mori; Yusuke Mitari; Yuji

Kaneda.
DESCRIPTION: Reliable detection of ordinary facial expressions (e.g. smile)

despite the variability among individuals as well as face appearance is an important
step toward the realization of perceptual user interface with autonomous perception
of persons. We describe a rule-based algorithm for robust facial expression
recognition combined with robust face detection using a convolutional neural
network. In this study, we address the problem of subject independence as well as
translation, rotation, and scale invariance in the recognition of facial expression.
The result shows reliable detection of smiles with recognition rate of 97.6% for
5600 still images of more than 10 subjects. The proposed algorithm demonstrated
the ability to discriminate smiling from talking based on the saliency score
obtained from voting visual cues. To the best of our knowledge, it is the first facial
expression recognition model with the property of subject independence combined
with robustness to variability in facial appearance.
4. TOPIC: Facial expression recognition,” Master’
AUTHOR’S: C. Zor,.
DESCRIPTION: These Human facial expressions convey a lot of information
visually rather than articulately. Facial expression recognition plays a crucial role
in the area of human-machine interaction. Automatic facial expression recognition
system has many applications including, but not limited to, human behavior
understanding, detection of mental disorders, and synthetic human expressions.
Recognition of facial expression by computer with high recognition rate is still a
challenging task. Two popular methods utilized mostly in the literature for the
automatic FER systems are based on geometry and appearance. Facial Expression
Recognition usually performed in four-stages consisting of pre-processing, face
detection, feature extraction, and expression classification. In this project we
applied various deep learning methods (convolutional neural networks) to identify
the key seven human emotions: anger, disgust, fear, happiness, sadness, surprise
and neutrality.
5. TOPIC: Imagenet classification with deep convolutional neural networks..
AUTHOR’S: Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton.
DESCRIPTION: We trained a large, deep convolutional neural network to

classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010
contest into the 1000 different classes. On the test data, we achieved top-1 and top-
5 error rates of 37.5%and 17.0% which is considerably better than the previous
state-of-the-art. Theneural network, which has 60 million parameters and 650,000
neurons, consistsof five convolutional layers, some of which are followed by max-
pooling layers,and three fully-connected layers with a final 1000-way softmax. To
make training faster, we used non-saturating neurons and a very efficient GPU
implementation of the convolution operation. To reduce overfitting in the fully-
connected layers we employed a recently-developed regularization method called
“dropout” that proved to be very effective. We also entered a variant of this model
in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of
15.3%, compared to 26.2% achieved by the second-best entry.
SURVEY OF PAPER:
Two different approaches are used for facial expression recognition, both of which
include two different methodologies, exist [6]. Dividing the face into separate
action units or keeping it as a whole for further processing appears to be the first
and the primary distinction between the main approaches. In both of these
approaches, two different methodologies, namely the ‘Geometric based’ and the
‘Appearance-based’ parameterizations, can be used.
Making use of the whole frontal face image and processing it in order to end up
with the classifications of 6 universal facial expression prototypes: disgust, fear,
joy, surprise, sad ness and anger; outlines the first approach. Here, it is assumed
that each of the above mentioned emotions have characteristic expressions on
face and that’s why recognition of them is necessary and sufficient. Instead of
using the face images as a whole, dividing them into some sub-sections for further
processing forms up the main idea of the second approach for facial expression
analysis.
As expression is more related with subtle changes of some discrete features

such as eyes, eyebrows and lip corners; these fine-grained changes are used for
analyzing automated recognition. There are two main methods that are used in
both of the above explained approaches. Geometric Based Parameterization is an
old way which consists of tracking and processing the motions of some spots on
image sequences, firstly presented by Suwa et al to recognize facial
expressions [7]. Cohn and Kanade later on tried geometrical modeling and tracking
of facial features by claiming that each AU is presented with a specific set of facial
muscles [8].
The disadvantages of this method are the contours of these features and
components have to be adjusted manually in this frame, the problems of robustness
and difficulties come out in cases of pose and illumination changes while the
tracking is applied on images, as actions & expressions tend to change both in
morphological and in dynamical senses, it becomes hard to estimate general
parameters for movement and displacement. Therefore, ending up with robust
decisions for facial actions under these varying conditions becomes to be difficult.
Rather than tracking spatial points and using positioning and movement parameters
that vary within time, color (pixel) information of related regions of face are
processed in Appearance Based Parameterizations; in order to obtain the
parameters that are going to form the feature vectors. Different features such as
Gabor, Haar wavelet coefficients, together with feature extraction and selection
methods such as PCA, LDA, and Adaboost are used within this framework.
For classification problem, algorithms like Machine learning, Neural Network,

Support Vector Machine, Deep learning, Naive Bayes are used. Raghuvanshi A.
et al have built a Facial expression recognition system upon recent research to
classify images of human faces into discrete emotion categories using
convolutional neural networks [9]. Alizadeh, Shima, and Azar Fazel have
developed Facial Expression Recognition system using Convolutional Neural
Networks based on Torch model [10].
3. METHODOLOGY :
The facial expression recognition system is implemented using convolutional

neural network.
The block diagram of the system is shown in following figures:
During training, the system received a training data comprising grayscale images
of faces with their respective expression label and learns a set of weights for the
network. The training step took as input an image with a face. Thereafter, an
intensity normalization is applied to the image. The normalized images are used to
train the Convolutional Network. To ensure that the training performance is not
affected by the order of presentation of the examples, validation dataset is used to
choose the final best set of weights out of a set of trainings performed with samples
presented in different orders.
The output of the training step is a set of weights that achieve the best result with
the training data. During test, the system received a grayscale image of a face from
test dataset, and output the predicted expression by using the final network weights
learned during training. Its output is a single number that represents one of the
seven basic expressions.
3.1 Dataset
The dataset from a Kaggle Facial Expression Recognition Challenge (FER2013) is

used for the training and testing. It comprises pre-cropped, 48-by-48-pixel
grayscale images of faces each labeled with one of the 7 emotion classes: anger,
disgust, fear, happiness, sadness, surprise, and neutral. Dataset has training set of
35887 facial images with facial expression labels..
The dataset has class imbalance issue, since some classes have large number of
examples while some has few. The dataset is balanced using oversampling, by
increasing numbers in minority classes. The balanced dataset contains 40263
images, from which 29263 images are used for training, 6000 images are used for
testing, and 5000 images are used for validation.
3.2 Architecture of CNN
A typical architecture of a convolutional neural network contains an input

layer, some convolutional layers, some fully-connected layers, and an output
layer. CNN is designed with some modification on LeNet Architecture [10]. It has
6 layers without considering input and output. The architecture of the Convolution
Neural Network used in the project is shown in the following figure.
1. Input Layer:
The input layer has pre-determined, fixed dimensions, so the image must be
pre-processed before it can be fed into the layer. Normalized gray scale images
of size 48 X 48 pixels from Kaggle dataset are used for training, validation and
testing. For testing propose laptop webcam images are also used, in which face
is detected and cropped using OpenCV Haar Cascade Classifier and
normalized.
2. Convolution and Pooling (ConvPool) Layers:
Convolution and pooling is done based on batch processing. Each batch has N
images and CNN filter weights are updated on those batches. Each convolution
layer takes image batch input of four dimension N x Color-Channel x width x
height. Feature map or filter for convolution are also four dimensional (Number
of feature maps in, number of feature maps out, filter width, filter height). In
each convolution layer, four dimensional convolution is calculated between
image batch and feature maps. After convolution only parameter that change is
image width and height.
New image width = old image width – filter width + 1
New image height = old image height – filter height + 1
After each convolution layer downsampling / subsampling is done for

dimensionality reduction. This process is called Pooling. Max pooling and
Average Pooling are two famous pooling method. In this project max pooling
is done after convolution. Pool size of (2x2) is12 taken, which splits the image
into grid of blocks each of size 2x2 and takes maximum of 4 pixels. After
pooling only height and width are affected.
Two convolution layer and pooling layer are used in the architecture. At first
convolution layer size of input image batch is Nx1x48x48. Here, size of image
batch is N, number of color channel is 1 and both image height and width are
48 pixel. Convolution with feature map of 1x20x5x5 results image batch is of
size Nx20x44x44. After convolution pooling is done with pool size of 2x2,
which results image batch of size Nx20x22x22. This is followed by second
convolution layer with feature map of 20x20x5x5, which results image
batch of size Nx20x18x18. This is followed by pooling layer with pool size
2x2, which results image batch of size Nx20x9x9.
3. Fully Connected Layer
This layer is inspired by the way neurons transmit signals through the brain. It
takes a large number of input features and transform features through layers
connected with trainable weights. Two hidden layers of size 500 and 300 unit
are used in fully-connected layer. The weights of these layers are trained by
forward propagation of training data then backward propagation of its errors.
Back propagation starts from evaluating the difference between prediction
and true value, and back calculates the weight adjustment needed to every layer
before. We can control the training speed and the complexity of the architecture
by tuning the hyper-parameters, such as learning rate and network density.
Hyper-parameters for this layer include learning rate, momentum,
regularization parameter, and decay.
The output from the second pooling layer is of size Nx20x9x9 and input of first
hidden layer of fully-connected layer is of size Nx500. So, output of pooling
layer is flattened to Nx1620 size and fed to first hidden layer. Output from first
hidden layer is fed to second hidden layer. Second hidden layer is of size Nx300
and its output is fed to output layer of size equal to number of facial expression
classes.
4. Output Layer
Output from the second hidden layer is connected to output layer having seven
distinct classes. Using Softmax activation function, output is obtained using the
probabilities for each of the seven class. The class with the highest probability
is the predicted class.
SYSTEM REQUIREMENT:
HARDWARE REQUIREMENTS:
Processor : DUAL CORE 2.5GHZ
Ram : 1 GB SD RAM
Monitor : 15” COLOR
Hard Disk : 80 GB
Keyboard : STANDARD 102 KEYS
Mouse : 3 BUTTONS
SOFTWARE CONFIGURATION:
Operating System : Windows XP Professional

Environment : MATLAB
MATLAB : Version 18a.
SOFTWARE TOOLS AND METHODOLOGY
ABOUT MATLAB
MATLAB is a software package for high performance numerical

computation and visualization. It provides an interactive environment with
hundreds of built-in functions for technical computation, graphics and animation.
Best of all, it provides easy extensibility with its own high-level programming
language. The name MATLAB stands for Matrix Laboratory. The basic building
block of MATLAB is the matrix. The fundamental data type is the array.
MATLABs built-in functions provide excellent tools for linear algebra

computations, data analysis, signal processing, optimization, numerical solutions
of ODES, quadrature and many other types of scientific computations. Most of
these functions use the state-of-the art algorithms. There are numerous functions
for 2-D and 3-D course, MATLAB even provides an external interface to run those
programs from within MATLAB. The user, however, is not limited to the built-in
functions, he can write his own functions in the MATLAB language. Once written,
these functions behave just like the built-in functions. MATLAB’s language is
very easy to learn and to use.
MATLAB TOOLBOXES
There are several optional ‘Toolboxes’ available from the developers of the
MATLAB. These tool boxes are collection of functions written for special
applications such as Symbolic Computations Toolbox, Image Processing Toolbox,
Statistics Toolbox, and Neural Networks Toolbox, Communications Tool box,
Signal Processing Toolbox, Filter Design Toolbox, Fuzzy Logic Toolbox, Wavelet
Toolbox, Data base Toolbox, Control System Toolbox, Bioinformatics Toolbox,
Mapping Toolbox.
BASICS OF MATLAB
MATLAB WINDOWS
On all UNIX systems, Macs, and PC, MATLAB works through three basic
windows. They are:
a. Command window:
This is the main window. The MATLAB command prompt characterizes it

‘>>’.when you launch the application program, MATLAB puts you in this window
.All commands, including those for running user-written programs, are typed in
this window at the MATLAB prompt.
b. Graphics window:
The output of all graphics commands are typed in the command window are
flushed to the graphics or figure window, a separate gray window with(default)
white background color. The user can create as many figure windows, as the
system memory will allow.
c. Edit window:
This is where you write, edit, create, and save your own programs in files
called M-files. We can use any text editor to carry out these tasks. On the most
systems, such as PC’s and Macs, MATLAB provides its build in editor. On other
systems, you can invoke the edit window by typing the standard file editing
command that you normally use on your systems. The command is typed at the
MATLAB prompt following the special character ‘!’ character. After editing is
completed, the control is returned to the MATLAB.
On-Line Help
a. On-line documentation:
MATLAB provides on-line help for all its built-in functions and
programming language constructs. The commands look for, help, help win, and
helpdesk provides on-line help.
b. Demo:
MATLAB has a demonstration program that shows many of its features. The
program includes a tutorial introduction that is worth trying. Type demo at the
MATLAB prompt to invoke the demonstration program, and follow the instruction
on the screen.
Input-Output
MATLAB supports interactive computation taking the input from the screen,
and flushing the output to the screen. In addition, it can read input files and write
output files. The following features hold for all forms of input-output.
a. Data type
The fundamental data type in the MATLAB is the array. It encompasses

several distinct data objects-integers, doubles, matrices, character strings, and cells.
In most cases, however, we never have to worry about the data type or the data
object declarations. For example there is no need to declare variables, as real or
complex .When a real number is entered as the variable, MATLAB automatically
sets the variable to be real.
b. Dimensioning
Dimensioning is automatic in MATLAB. No dimensioning statements are

required for vectors or arrays. We can find the dimension of an existing matrix or a
vector with the size and length commands.
C. Case sensitivity
MATLAB is case sensitive i.e. it differentiates between the lower case and
the uppercase letters. Thus A is different variables. Most MATLAB commands are
built-in function calls are typed in lower case letters. We can turn case sensitivity
on and off with casesen command.
d. Output display
The output of every command is displayed on the screen unless MATLAB is

directed otherwise. A semicolon at the end of a command suppresses the screen
output, except for graphics and on-line help command. The following facilities are
provided for controlling the screen output.
i. Paged output
To direct the MATLAB to show one screen of output at a time, type more on
the MATLAB prompt. Without it, MATLAB flushes the entire output at once,
without regard to the speed at which we read.
ii. Output format
Though computations inside the MATLAB are performed using the double
precision, the appearance of floating point numbers on the screen is controlled by
the output format in use. There are several different screen output formats. The
following table shows the printed value of 10pi in different formats.
Format short 31.4159

Format short e 3.1416e+01
Format long 31.41592653589793
Format long e 3.141592653589793e+01
Format short g 31.416
Format long g 31.4159265358979
Format hex 403f6a7a2955385e
Format rat 3550/113
Format bank 31.42
e. Command History
MATLAB saves previously typed commands in a buffer. These commands

can be called with the up-arrow key. This helps in editing previous commands.
You can also recall a previous command by typing the first characters and then
pressing the up-arrow key. On most UNIX systems, MATLABS command line
editor also understands the standard emacs key bindings.
File Types
MATLAB has three types of files for storing information
M-files: M-files are standard ASCII text files, with a .m extension to the file
name. There are two types of these files: script files and function files. Most
programs we write in MATLAB are saved as M-files. All built-in functions in
MATLAB are M-files, most of which reside on our computer in precompiled
format. Some built in functions are provided with source code in readable M-files
so that can be copied and modified.
Mat-files: Mat-files are binary data-files with a .mat extension to the file name.
Mat-files are created by MATLAB when we save data with the save command.
The data is written in a special format that only MATLAB can read. Mat-files can
be loaded into MATLAB with the load command.
Mex-files: Mex-files are MATLAB callable FORTRAN and C programs, with

a.mex extension to the file name. Use of these files requires some experience with
MATLAB and a lot of patience.
Platform independence
One of the best features of MATLAB is its platform-independence.

Programs written in the MATLAB language work exactly the same way on all
computers. The user interface however, varies from platform to platform. For
example, on PC’s and Macs there are menu driven commands for opening, writing,
editing, saving and printing files whereas on Unix machines such as sun
workstations, these tasks are usually performed with Unix commands.
Images in MATLAB
The project has involved understanding data in MATLAB, so below is a

brief review of how images are handled. Indexed images are represented by two
matrices, a color map matrix and image matrix.
(i)The color map is a matrix of values representing all the colours in the image.
(ii)The image matrix contains indexes corresponding to the colour map color map.
A color map matrix is of size N*3, where N is the number of different colors I the
image. Each row represents the red, green, blue components for a colour.
E.g. the matrix
Represents two colors, the first have components r1, g1, b1 and the second having
the components r2, g2, b2
The wavelet toolbox only supports indexed images that have linear,
monotonic color maps. Often color images need to be pre-processed into a grey
scale image before using wavelet decomposition. The Wavelet Toolbox User’s
Guide provides some sample code to convert color images into grey scale. This
will be useful if it is needed to put any images into MATLAB.
This chapter dealt with introduction to MATLAB software which we are

using for our project. The 2-D wavelet Analysis, the decomposition of an image
into approximations and details and the properties of different types of wavelets
will be discussed in the next chapter.
MATLAB
 Matlab is a high-performance language for technical computing.
 It integrates computation, programming and visualization in a user-friendly
environment where problems and solutions are expressed in an easy-to-
understand mathematical notation.
 Matlab is an interactive system whose basic data element is an array that
does not
 Require dimensioning.
 This allows the user to solve many technical computing problems, especially
those with matrix and vector operations, in less time than it would take to
write a program in a scalar non-interactive language such as C or
FORTRAN.
 Matlab features a family of application-specific solutions which are called
toolboxes.
 It is very important to most users of Matlab, that toolboxes allow to learn
and apply
 Specialized technology.
 These toolboxes are comprehensive collections of Matlab functions, so-
called M-files that extend the Matlab environment to solve particular classes
of problems.
 Matlab is a matrix-based programming tool. Although matrices often need
not to be
 Dimensioned explicitly, the user has always to look carefully for matrix
dimensions.
 If it is not defined otherwise, the standard matrix exhibits two dimensions n
× m.
 column vectors and row vectors are represented consistently by n × 1 and 1
× n matrices, respectively.
MATLAB OPERATIONS
 Matlab operations can be classified into the following types of operations:

 Arithmetic and logical operations,
 Mathematical functions,
 Graphical functions, and
 Input/output operations.
 In the following sections, individual elements of Matlab operations are
explained in detail.
EXPRESSIONS
 Like most other programming languages, Matlab provides mathematical

expressions,
But unlike most programming languages, these expressions involve entire
matrices. The building blocks of expressions are
 Variables
 Numbers
 Operators
 Functions
VARIABLES
 Matlab does not require any type declarations or dimension statements.

 When a new variable name is introduced, it automatically creates the
variable and allocates the appropriate amount of memory.
 If the variable already exists, Matlab changes its contents and, if necessary,
allocates new storage.
 For example
 >> books = 10
 It creates a 1-by-1 matrix named books and stores the value 10 in its single
element.
 In the expression above, >> constitutes the Matlab prompt, where the
commands can be entered.
 Variable names consist of a string, which start with a letter, followed by any
number of letters, digits, or underscores. Matlab is case sensitive; it
distinguishes between uppercase and lowercase letters. A and a are not the
same variable.
 To view the matrix assigned to any variable, simply enter the variable name.
NUMBERS
 Matlab uses the conventional decimal notation.
 A decimal point and a leading plus or minus sign is optional. Scientific
notation uses the letter e to specify a power-of-ten scale factor.
 Imaginary numbers use either i or j as a suffix.
 Some examples of legal numbers are:
7 -55 0.0041 9.657838 6.10220e-10 7.03352e21 2i -2.71828j 2e3i 2.5+1.7j.
OPERATORS
Expressions use familiar arithmetic operators and precedence rules. Some

examples are:
 + Addition
 - Subtraction
 Multiplication
 / Division
 ’ Complex conjugate transpose
 ( ) Brackets to specify the evaluation order.
FUNCTIONS
 Matlab provides a large number of standard elementary mathematical
functions, including sin, sqrt, expand abs.
 Taking the square root or logarithm of a negative number does not lead to an
error; the appropriate complex result is produced automatically.
 Matlab also provides a lot of advanced mathematical functions, including
Bessel and Gamma functions. Most of these functions accept complex
arguments.
 For a list of the elementary mathematical functions, type
 >> help elfun
 Some of the functions, like sqrt and sin are built-in. They are a fixed part of
the Matlab core so they are very efficient.
 The drawback is that the computational details are not readily accessible.
Other functions, like gamma and sinh, are implemented in so called M-files.
 You can see the code and even modify it if you want.
MATLAB IMAGE PROCESSING:
INTRODUCTION
When working with images in Matlab, there are many things to keep in mind
such as loading an image, using the right format, saving the data as different data
types, how to display an image, conversion between different image formats, etc.
This worksheet presents some of the commands designed for these operations.
Most of these commands require you to have the Image processing tool box
installed with Matlab. To find out if it is installed, type over at the Matlab prompt.
This gives you a list of what tool boxes that are installed on your system.
For further reference on image handling in Matlab you are recommended to
use Mat lab’s help browser. There is an extensive (and quite good) on-line manual
for the Image processing tool box that you can access via Mat lab’s help browser.
The first sections of this worksheet are quite heavy. The only way to
understand how the presented commands work is to carefully work through the
examples given at the end of the worksheet. Once you can get these examples to
work, experiment on your own using your favorite image!
FUNDAMENTALS
A digital image is composed of pixels which can be thought of as small dots
on the screen. A digital image is an instruction of how to color each pixel. We will
see in detail later on how this is done in practice. A typical size of an image is 512-
by-512 pixels. Later on in the course you will see that it is convenient to let the
dimensions of the image to be a power of 2. For example, 2 9=512. In the general
case we say that an image is of size m-by-n if it is composed of m pixels in the
vertical direction and n pixels in the horizontal direction.
Let us say that we have an image on the format 512-by-1024 pixels. This
means that the data for the image must contain information about 524288 pixels,
which requires a lot of memory! Hence, compressing images is essential for
efficient image processing. You will later on see how Fourier analysis and Wavelet
analysis can help us to compress an image significantly. There are also a few
"computer scientific" tricks (for example entropy coding) to reduce the amount of
data required to store an image.
IMAGE FORMATS SUPPORTED BY MATLAB

The following image formats are supported by Matlab:
 BMP
 HDF
 JPEG
 PCX
 TIFF
 XWB
Most images you find on the Internet are JPEG-images which is the name for
one of the most widely used compression standards for images. If you have stored
an image you can usually see from the suffix what format it is stored in. For
example, an image named myimage.jpg is stored in the JPEG format and we will
see later on that we can load an image of this format into Matlab.
WORKING FORMATS IN MATLAB

If an image is stored as a JPEG-image on your disc we first read it into
Matlab. However, in order to start working with an image, for example perform a
wavelet transform on the image, we must convert it into a different format. This
section explains four common formats.
INTENSITY IMAGE (GRAY SCALE IMAGE)

This is the equivalent to a "gray scale image" and this is the image we will
mostly work with in this course. It represents an image as a matrix where every
element has a value corresponding to how bright/dark the pixel at the
corresponding position should be colored. There are two ways to represent the
number that represents the brightness of the pixel: The double class (or data type).
This assigns a floating number ("a number with decimals") between 0 and 1 to
each pixel.
The value 0 corresponds to black and the value 1 corresponds to white. The
other class is called uint8 which assigns an integer between 0 and 255 to represent
the brightness of a pixel. The value 0 corresponds to black and 255 to white. The
class uint8 only requires roughly 1/8 of the storage compared to the class double.
On the other hand, many mathematical functions can only be applied to the double
class. We will see later how to convert between double and uint8.
BINARY IMAGE
This image format also stores an image as a matrix but can only color a pixel
black or white (and nothing in between). It assigns a 0 for black and a 1 for white.
INDEXED IMAGE
This is a practical way of representing color images. (In this course we will
mostly work with gray scale images but once you have learned how to work with a
gray scale image you will also know the principle how to work with color images.)
An indexed image stores an image as two matrices. The first matrix has the same
size as the image and one number for each pixel. The second matrix is called the
color map and its size may be different from the image. The numbers in the first
matrix is an instruction of what number to use in the color map matrix.
RGB IMAGE
This is another format for color images. It represents an image with three
matrices of sizes matching the image format. Each matrix corresponds to one of
the colors red, green or blue and gives an instruction of how much of each of these
colors a certain pixel should use.
MULTIFRAME IMAGE
In some applications we want to study a sequence of images. This is very
common in biological and medical imaging where you might study a sequence of
slices of a cell. For these cases, the multiframe format is a convenient way of
working with a sequence of images. In case you choose to work with biological
imaging later on in this course, you may use this format.
How to convert between different formats

The following table shows how to convert between the different formats
given above. All these commands require the Image processing tool box!
IMAGE FORMAT CONVERSION
(Within the parenthesis you type the name of the image you wish to convert.)
Matlab
Operation:
command:
Convert between intensity/indexed/RGB formats to

dither()
binary format.
Convert between intensity format to indexed

gray2ind()
format.
Convert between indexed format to intensity

ind2gray()
format.
Convert between indexed format to RGB format. ind2rgb()
Convert a regular matrix to intensity format by

mat2gray()
scaling.
Convert between RGB format to intensity format. rgb2gray()

Convert between RGB format to indexed format. rgb2ind()
The command mat2gray is useful if you have a matrix representing an image

but the values representing the gray scale range between, let's say, 0 and 1000. The
command mat2gray automatically re scales all entries so that they fall within 0 and
255 (if you use the uint8 class) or 0 and 1 (if you use the double class).
How to convert between double and uint8

When you store an image, you should store it as a uint8 image since this
requires far less memory than double. When you are processing an image (that is
performing mathematical operations on an image) you should convert it into a
double. Converting back and forth between these classes is easy.
I=im2double (I);
Converts an image named I from uint8 to double.
I=im2uint8 (I);
An image named I from double to uint8.
READ MATLAB FILE:

When you encounter an image you want to work with, it is usually in form
of a file (for example, if you down load an image from the web, it is usually stored
as a JPEG-file). Once we are done processing an image, we may want to write it
back to a JPEG-file so that we can, for example, post the processed image on the
web. This is done using the imread and imwrite commands. These commands
require the Image processing tool box!
Reading and writing image files
Matlab
Operation:
command:
Read an image.
(Within the parenthesis you type the name of the image file you
imread()
wish to read.
Put the file name within single quotes ' '.)
Write an image to a file.

(As the first argument within the parenthesis you type the name of
the image you have worked with.
imwrite( , )
As a second argument within the parenthesis you type the name of
the file and format that you want to write the image to.
Put the file name within single quotes ' '.)
Make sure to use semi-colon ; after these commands, otherwise you will get
LOTS OF number scrolling on your screen... The commands imread and imwrite
support the formats given in the section "Image formats supported by Matlab"
above.
Loading and saving variables in Matlab

This section explains how to load and save variables in Matlab. Once you have
read a file, you probably convert it into an intensity image (a matrix) and work
with this matrix. Once you are done you may want to save the matrix representing
the image in order to continue to work with this matrix at another time. This is
easily done using the commands save and load. Note that save and load are
commonly used Matlab commands, and works independently of what tool boxes
that are installed.
Loading and saving variables
Operation: Matlab command:
Save the variable X. save X
Load the variable X. load X
3.RESULTS AND ANALYSIS :
CNN architecture for facial expression recognition as mentioned above was

implemented in Python. Along with Python programming language, Numpy,
Theano and CUDA libraries were used.
Training image batch size was taken as 30, while filter map is of size
20x5x5 for both convolution layer. Validation set was used to validate the
training process. In last batch of every epoch in validation cost, validation
error, training cost, training error are calculated. Input parameters for training
are image set and corresponding output labels. The training process updated the
weights of feature maps and hidden layers based on hyper-parameters such as
learning rate, momentum, regularization and decay. In this system batch-wise
learning rate was used as 10e-5, momentum as 0.99, regularization as 10e-7 and
decay as 0.99999.
The comparison of validation cost, validation error, training cost, training error
are shown in figures below.
Above graph shows that the in 50 epochs training cost and error rate is reduced
0.35 and 0.1 respectively while validation cost and error is 1.25 and 0.45
respectively.
The testing of the model is carried out using 6000 images. The classifier
provided 56.77 % accuracy. The confusion matrix for seven facial expression
classes is shown below:
The normalized confusion matrix is shown below:

Above tables showed that this model highest accuracy for disgust emotion
with 95.23 %, followed by happy with 68.86 %, surprise with 64.52 %, neutral
with 49.31 %, anger with 42.16 %, fear with 38.31 % and lowest accuracy for
sad emotion as 38.23 %.
The overall precision and recall are 0.57 and 0.57 respectively. The model
performs really well on classifying positive emotions resulting in relatively high
precision scores for happy and surprised. Disgust has highest precision and
recall as 0.95 and 0.99 as images in this class were oversampled to address class
imbalance. Happy has a precision of 0.68 and recall of 0.69 which could be
explained by having the most examples (6500) in the training set.
Interestingly, surprise has a precision of 0.69 and recall of 0.65 having the least
examples in the training set. There must be very strong signals in the surprise
expressions.
Model performance seems weaker across negative emotions on average. In

particularly, the emotion sad has a low precision of only 0.44 and recall
0.38. The model frequently misclassified angry, fear and neutral as sad. In
addition, it is most confused when predicting sad and neutral faces because
these two emotions are probably the least expressive (excluding crying faces).
The overall F1-score is also 0.57 . F1-score is highest for disgust due to
oversampling of images.
Happy and surprise have higher F1-score as 0.69 and 0.67 respectively. Fear
has least F1-score as 0.39 and sad, anger and neutral also have low F1-score.
CNN Classifier is then used to classify image taken from webcam in Laptop.
Face is detected in webcam frames using Haar cascade classifier from OpenCV.
Then detected face is cropped and normalized and fed to CNN Classifier. Some
classification results using webcam are listed in Appendix A.
5.1 CONCLUSION
In this project, a LeNet architecture based six layer convolution neural network
is implemented to classify human facial expressions i.e. happy, sad, surprise,
fear, anger, disgust, and neutral. The system has been evaluated using
Accuracy, Precision, Recall and F1-score. The classifier achieved accuracy of
56.77 % , precision of 0.57, recall 0.57 and F1-score 0.57.
REFERENCES:
[1] Shan, C., Gong, S., & McOwan, P. W. (2005, September). Robust
facial expression recognition using local binary patterns. In Image Processing,
2005. ICIP 2005. IEEE International Conference on (Vol. 2, pp. II-370). IEEE.
[2] Chibelushi, C. C., & Bourel, F. (2003). Facial expression recognition: A

brief tutorial overview. CVonline: On-Line Compendium of Computer Vision,
[3] "Convolutional Neural Networks (LeNet) – DeepLearning 0.1

documentation". DeepLearning 0.1. LISA Lab. Retrieved 31 August 2013.
[4] Matusugu, Masakazu; Katsuhiko Mori; Yusuke Mitari; Yuji Kaneda (2003).
"Subject independent facial expression recognition with robust face
detection using a convolutional neural network" (PDF). Neural Networks.
16 (5): 555–559. doi:10.1016/S0893-6080(03)00115-1. Retrieved 17
November 2013.
[5] LeCun, Yann. "LeNet-5, convolutional neural networks". Retrieved 16
November 2013 .
[6] C. Zor, “Facial expression recognition,” Master’s thesis, University of

Surrey, Guildford, 2008.
[7] Suwa, M.; Sugie N. and Fujimora K. A Preliminary Note on Pattern

Recognition of Human Emotional Expression, Proc. International Joint
Conf, Pattern Recognition, pages 408-410, 1978
[8] Recognizing action units for facial expression analysis YI Tian, T Kanade,
JF Cohn IEEE Transactions on pattern analysis and machine intelligence 23 (2),
97-115
[9] Raghuvanshi, Arushi, and Vivek Choksi. "Facial Expression Recognition

with Convolutional Neural Networks." Stanford University, 2016
[10] Alizadeh, Shima, and Azar Fazel. "Convolutional Neural Networks for
Facial Expression Recognition." Stanford University, 2016
[11] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ”Imagenet
classification with deep convolutional neural networks.” Advances in neural
information processing systems. 2012.
[12] Simonyan, Karen, and Andrew Zisserman. ”Very deep convolutional

networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556
(2014).

Facial Expression Recognition System Using Convolutional Neural Network

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Facial Expression Recognition System Using Convolutional Neural Network

Hochgeladen von

Copyright:

Verfügbare Formate

FACIAL EXPRESSION RECOGNITION SYSTEM USING

CONVOLUTIONAL NEURAL NETWORK

A Facial expression is the visible manifestation of the affective state,

Facial expression recognition system is implemented using Convolution

Keywords: Facial Expression Recognition, Convolutional Neural Network, Deep

A Facial expression is the visible manifestation of the affective state,

Automatic recognition of facial expressions can be an important component of

On a day to day basics humans commonly recognize emotions by characteristic

In machine learning, a convolutional neural network (CNN, or ConvNet) is a type

The convolutional neural network is also known as shift invariant or space

The primary purpose of Convolution in case of a CNN is to extract features from

The size of the Feature Map (Convolved Feature) is controlled by three

 Depth: Depth corresponds to the number of filters we use for the

2. Rectified Linear Unit:

R(x) = Max(0,x) (2.2)

Spatial Pooling (also called subsampling or downsampling) reduces the

 Makes the input representations (feature dimension) smaller and more

The Fully Connected layer is a traditional Multi-Layer Perceptron that uses a

Apart from classification, adding a fully-connected layer is also a (usually)

1. TOPIC: Robust facial expression recognition using local binary patterns.

AUTHOR’S: Shan, C., Gong, S., & McOwan, P. W.

A novel low-computation discriminative feature space is introduced for facial

2. TOPIC: Facial expression recognition: A brief tutorial overview.

AUTHOR ‘S: Chibelushi, C. C., & Bourel, F..

DISCRIPTION: Facial expressions convey non-verbal cues, which play an

3. TOPIC: "Subject independent facial expression recognition with robust face

AUTHOR’S: Matusugu, Masakazu; Katsuhiko Mori; Yusuke Mitari; Yuji

DESCRIPTION: Reliable detection of ordinary facial expressions (e.g. smile)

4. TOPIC: Facial expression recognition,” Master’

5. TOPIC: Imagenet classification with deep convolutional neural networks..

AUTHOR’S: Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton.

DESCRIPTION: We trained a large, deep convolutional neural network to

As expression is more related with subtle changes of some discrete features

For classification problem, algorithms like Machine learning, Neural Network,

The facial expression recognition system is implemented using convolutional

The block diagram of the system is shown in following figures:

The dataset from a Kaggle Facial Expression Recognition Challenge (FER2013) is

A typical architecture of a convolutional neural network contains an input

2. Convolution and Pooling (ConvPool) Layers:

New image height = old image height – filter height + 1

After each convolution layer downsampling / subsampling is done for

3. Fully Connected Layer

Operating System : Windows XP Professional

SOFTWARE TOOLS AND METHODOLOGY

MATLAB is a software package for high performance numerical

MATLABs built-in functions provide excellent tools for linear algebra

This is the main window. The MATLAB command prompt characterizes it

The fundamental data type in the MATLAB is the array. It encompasses

Dimensioning is automatic in MATLAB. No dimensioning statements are

The output of every command is displayed on the screen unless MATLAB is

ii. Output format

Format short 31.4159

MATLAB saves previously typed commands in a buffer. These commands

MATLAB has three types of files for storing information

Mex-files: Mex-files are MATLAB callable FORTRAN and C programs, with

One of the best features of MATLAB is its platform-independence.

The project has involved understanding data in MATLAB, so below is a

This chapter dealt with introduction to MATLAB software which we are

 Matlab operations can be classified into the following types of operations:

 Like most other programming languages, Matlab provides mathematical

 Matlab does not require any type declarations or dimension statements.

Expressions use familiar arithmetic operators and precedence rules. Some