Sie sind auf Seite 1von 8

AN END-TO-END TRAINING BASED APPROACH FOR CAPTIONING

COMPLEX IMAGES USING NEURAL NETWORKS

By

Shreyas V Kashyap - 1BG13CS097


Swaroop Gupta B.A -1BG13CS113
Spandana Rao B.A - 1BG13CS103

Under the Guidance of


Smt.Jebah Jaykumar
AREA OF RESEARCH

We have hereby chose the area of research to be a combination of the following


research extensive areas.

Machine Learning

Image Processing
CORE AREA OF PROBLEM

Our core area of problem is

Computer Vision

Program a computer to "understand" a scene or features in an image.

Concerned with the automatic extraction, analysis and understanding of useful


information from a single image or a sequence of images
RELEVANCE OF THE PROBLEM

Object detection

Currently the existing state of the art methods can detect a single object or
multiple non overlapping objects

This makes the detection useless for any analysis of an entire scene

Object labelling

The labelling of single prominent object in an image

Unable to describe a scene in a complex dense detailed image


APPLICATION OF THE PROBLEM

Currently the problem persists in the following applications of Computer Vision

Image search

The existing image search works only on the file name of an image, and not on the
details of the scene
This should be overcome and the search should happen only on the basis of what is
there in the image, instead of the filename

Image scene detection

The existing Image detection, detects a single prominent part of the image and
cannot detect if there are variations of viewpoint.
This should be overcome and multiple objects need to be detected and the whole
scene has to be described.
PROBLEM STATEMENT

Our ability to effortlessly describe all aspects of an image relies on a strong semantic
understanding of a visual scene and all of its elements. However, despite numerous
potential applications, this ability remains a challenge for our state of the art visual
recognition systems

Our goal is to design an architecture that jointly localizes regions of interest and the
describes each with natural language

Architecture is composed of a Convolutional Network, an efficient dense localization


layer, and Recurrent Neural Network language model that generates the label
sequences for the complex images.
INPUT / OUTPUT EXAMPLE

Sample Input :

Output of existing System :

Output Expected :
THANK YOU

Das könnte Ihnen auch gefallen