Beruflich Dokumente
Kultur Dokumente
Contents :
• Introduction
Bhujbal Knowledge City
MET Institute of Engineering
INTRODUCTION
The task of image captioning can be divided into two modules logically – one is an image
based model – which extracts the features and nuances out of our image, and the other
is a language based model – which translates the features and objects given by our
image based model to a natural sentence.
For our image based model (viz encoder) – we usually rely on a Convolutional Neural
Network model. And for our language based model (viz decoder) – we rely on a
Recurrent Neural Network.
Bhujbal Knowledge City
MET Institute of Engineering
CNN
A convolution neural network (CNN) is a specific type of artificial neural
network that uses perceptrons, a machine learning unit algorithm, for
supervised learning, to analyze data.
CNNs apply to image processing, natural language processing and other
kinds of cognitive tasks.
CNNs can encode abstract features from images. These can then be used
for classification, object detection, segmentation, captioning and various
other tasks.
Bhujbal Knowledge City
MET Institute of Engineering
CNN
Convolutional networks are trainable multistage architectures with each
stage consisting of multiple layers.
The input and output of each stage are sets of arrays called as feature
maps.
In the case of a colored image, each feature map would be a 2D array
containing a color channel of the input image, a 3D array for a video and a
1D array for an audio input.
Eg., An image of 6 x 6 x 3 array of matrix of RGB.
Bhujbal Knowledge City
MET Institute of Engineering
Convolution Layer
Convolution is the first layer to extract features from an input image.
This layer is the core building block of a CNN. The layer’s parameters consist
of learnable kernels or filters which extend through the full depth of the input
It is a mathematical operation that takes two inputs such as image matrix
and a filter or kernel.
===
Bhujbal Knowledge City
MET Institute of Engineering
Non-linearity Layer
This is a layer of neurons which apply various activation functions.
The activation functions are typically sigmoid, tanh and ReLU.
ReLU stands for rectified linear unit, and is a type of activation function.
Mathematically, it is defined as y = max(0, x).
This functions helps us to make sense and extract knowledge form such
complicated big datasets.
It makes the network more powerful.
Adds ability to it to learn something complex and complicated form data
and represent non-linear complex arbitrary functional mappings between
inputs and outputs.
Bhujbal Knowledge City
MET Institute of Engineering
Pooling Layer
Pooling layers section would reduce the number of parameters when the
images are too large.
Pooling is done for the sole purpose of reducing the spatial size of the
image.
Spatial pooling also called subsampling or downsampling which reduces
the dimensionality of each map but retains the important information.
Spatial pooling can be of different types:
•Max Pooling
•Average Pooling
•Sum Pooling
Bhujbal Knowledge City
MET Institute of Engineering
Applications of CNN:
1) Speech Recognition: Convolutional Neural Networks have been used recently in
Speech Recognition and has given better results over Deep Neural Networks (DNN).
2) It is also used in object tracking and video classification.
3)It helps in iterative image reconstruction and super resolution of low level images.
4)It supports edge detection and semantic segmentation.
Bhujbal Knowledge City
MET Institute of Engineering
Thank You …