Sie sind auf Seite 1von 25

B.M.

S College of Engineering
(Autonomous Institution under VTU)
Bangalore-560 019

DEPARTMENT OF INFORMATION SCIENCE & ENGINEERING

Course​: ​Project Work-Phase 2

Course Code​: ​16IS8DCPRW

Project Phase – II: Review 1

​ ​Batch Number​: ​17

Project Title:​ ​Deep Learning Image Caption Generator

Project Guide​:​ Nalina V ​Team Members​:

P Aishwarya Naidu (1BM16IS062)

Satvik Vats (1BM16IS079)

Gehna Anand (1BM16IS034)


INTRODUCTION

Digital image processing is the modern processing technique that uses the features of an
image to analyze its various properties and use this data to produce and extract valuable
information from the image that can be used for a wide range of scientific purposes. It is helpful
for many applications and their analysis, which can be used in different applications like in
vehicle detection from an image using aerial cameras, doing forensic screening, disease detection
using machine learning etc.. One such application by using this concept can be applied in the
keyboard industry where poorly manufactured keyboards can be detected at manufacturing stage.
In this type of application an input image of the manufactured keyboard is fed to detect the
missing key or damaged key. Its use seems to be in every sector from manufacturing to service
and it has proved to be a boon for workers from different sectors. A three step procedure is
typically used to perform digital image processing which starts with importing the image to be
analysed using some acquisition tools, followed by analysing and manipulating the image
followed by presenting output in which result can be altered image or a report which is based on
analysing that image. When looked at as a phase based process it involves the following phases
Acquisition, Image enhancement, Image restoration, Color Image processing, Wavelets and
Multi-Resolution Processing, Image Compression, Morphological Processing, Segmentation
Procedure,Representation and Description and Object Detection and Recognition. Digital image
processing finds its use not only in manufacturing technology but also in law and order
maintenance, therefore they are being utilised by governments and other law enforcement
agencies across the globe to provide smart surveillance and even being employed in complex
anti-terror operations, where digital image processing is used to match various trait of the
suspected person by generating important features from the available samples.
Computer vision is a kind of automated watchdog, which uses both science and
technology. Being a discipline from science, computer vision is related to theory for design of
artificial systems that can acquire information from images. The image input may be of many
formats, such as a video signal sequence, or multiple views from different cameras, or data input
from a medical scanning machine. Examples of applications of computer vision include systems
for controlling processes such as an industrial robot or an autonomous vehicle; for detecting
events such as in visual surveillance or people counting; for organizing information such as for
indexing databases of images and image sequences; for modeling objects or environments such
as industrial inspection, medical image analysis or topographical modeling; for interaction such
as the input to a device for interaction between a computing machine and human. Both the
techniques of image processing and computer vision are in contrast with each other as, in image
processing, image is taken as an input and a processed image comes as an output whereas in
computer vision image is taken as input and we get information about the image as the output,
because basically, image processing is related to enhancing the image and play with the features
like colors. While computer vision is related to "Image Understanding" and it can use machine
learning as well. These two can be used in combination to develop a model that first processes an
image to convert it into a proper format like feature vector and then use computer vision to get an
output in the desired format by reading the generated feature vectors.
The Recurrent Neural Networks are very important for developing a model that has to
produce a natural language output describing something in more than one word of any natural
language as it uses the memory of decision made in just the previous iteration while processing
the image. Long-Short Term Memory (LSTM) processes entire sequences of data which includes
image,speech or a video. It is applicable to tasks such as unsegmented, connected handwriting
recognition, speech recognition and anomaly detection in network traffic or IDS's (intrusion
detection systems). The LSTM unit consists of four parts namely, cell, an input gate, an output
gate and a forget gate. The input,output and forget gates work as regulators to the information
coming in and going out of the LSTM unit. The captions in natural language are generated by
combining word by word the description of the different focus areas on the input image,
therefore it is very important to use LSTM units so that the words generated are related to each
other and can be used together forming a perfectly meaningful sentence in respective natural
language.
Processing an input image and generating a suitable caption describing that image has
various applications in different areas and is capable of playing a very crucial role in many
technical advancements like,
• Self-driving cars — Automatic driving is one of the biggest challenges and if we can properly
caption the scene around the car, it can give a boost to the self-driving system.
• Aid to the blind — We can create a product for the blind which will guide them travelling on
the roads without the support of anyone else. We can do this by first converting the scene into
text and then the text to voice. Both are now famous applications of Deep Learning.
• CCTV cameras are everywhere today, but along with viewing the world, if we can also
generate relevant captions, then we can raise alarms as soon as there is some malicious activity
going on somewhere. This could probably help reduce some crime and/or accidents.
• Automatic Captioning can help make Google Image Search as good as Google Search, as then
every image could be first converted into a caption and then search can be performed based on
the caption.
PROBLEM DEFINITION AND OBJECTIVES
Digital image processing has become important and economical in many fields like
signature recognition, iris recognition and face recognition, in forensics, in automobile detection
systems and in military applications across the world. Each of these applications has its special
basic requirements, which may be unique from the others. Any stakeholder of such systems or
models is concerned and demands their system to be faster, more accurate than other
counterparts as well as cheaper and equipped with more extensive computational powers. All
the desired traits from the systems are desirable as most of these systems are being used for
mission critical purposes and scope of any mistake should be very less. Such systems are
required to handle the complexity of problems of the modern world like intelligent crimes, smart
city needs like smart traffic control systems, disaster control and management systems etc., thus
a computer vision based model that is unbiased and free of any prejudice towards anything or
anyone is required to generate a caption describing the images given to it as input. So that such
description can be used to automate existing systems like traffic control systems, flood control
systems or surveillance systems, this will reduce chances of errors in such critical works and also
the surveillance can be conducted 24X7 without human interaction.
The problem introduces a captioning task, which requires a computer vision system to
both localize and describe salient regions in images in natural language. The image captioning
task generalizes object detection when the descriptions consist of a single word. Given a set of
images and prior knowledge about the content find the correct semantic label for the entire
image(s).
The major objectives of the project are :
1. Using Long-Short Term Memory (LSTM) to generate sentences in natural languages that
will combine words from vocabulary based on the different focus areas of the input
image and make sure that the words in the sentences are all related and makes perfect
sense.
2. Demonstrate successful use of Recurrent Neural Networks (RNNs) to generate captions
for input images in the system in natural language (English).
3. To get high accuracy in correctly describing an input image.
LITERATURE REVIEW

Neural Image Caption Generation with Visual Attention [1], in this publication the authors have
presented a model for image caption generation by combining the recent progress done in the
field of object detection and machine translation. It identifies different aspects and components
of an image based on the attention model over the input image in which each word is generated
by changing the attention to reflect relevant parts of the image. It describes how to train the
model in a deterministic manner using standard backpropagation techniques and stochastically
by maximizing a variational lower bound. It also shows through visualization how the model is
able to automatically learn to fix its gaze on salient objects while generating the corresponding
words in the output sequence. The publication further describes two variants of attention or two
forms of attention: a ‘hard’ version in which attention is given to smaller areas or more focused
areas over the image in comparison to the other variant called ‘soft’ version, in which the
focused area is more at a single word generation. It uses three benchmark datasets and the
performance is measured using the BLEU and METEOR metric.

Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator [2], this paper
focuses on a model for image caption generation that also rely on neural models, but rather than
performing partial or wholesale caption retrieval, generate novel captions using a recurrent
neural network (RNN), usually a long short-term memory (LSTM). Typically, such models use
image features extracted from a pre-trained convolutional neural network (CNN) such as the
VGG CNN to bias the RNN towards sampling terms from the vocabulary in such a way that a
sequence of such terms produces captions that are relevant to the image. This paper has
presented two views of the role of the RNN in an image caption generator. In the first, an RNN
decides on which word is the most likely to be generated next, given what has been generated
before. In multimodal generation, this view encourages architectures where the image is
incorporated into the RNN along with the words that were generated in order to allow the RNN
to make visually informed predictions.The second view is that the RNN’s role is purely
memory-based and is only there to encode the sequence of words that have been generated this
far. This representation informs caption prediction at a later layer of the network as a function of
both the RNN encoding and perceptual features. This view encourages architectures where
vision and language are brought together late, in a multimodal layer.

Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation
Measures [3], in this survey, the authors have classified the existing approaches of caption
generation based on how they conceptualize this problem, viz., models that cast description as
either generation problem or as a retrieval problem over a visual or multimodal representational
space. It provides a detailed review of existing models, highlighting their advantages and
disadvantages. Moreover, it also gives an overview of the benchmark image datasets and the
evaluation measures that have been developed to assess the quality of machine-generated image
descriptions. It also extrapolates future directions in the area of automatic image description
generation. The authors conclude from the survey that in comparison to the traditional
keyword-based image annotation (using object recognition, attribute detection, scene labeling,
etc.), automatic image description systems produce more human-like explanations of visual
content, providing a more complete picture of the scene.

Show and Tell: A Neural Image Caption Generator [4], in this paper the authors have developed
a model called NIC in which they present an end-to-end neural network system that can
automatically view an image and generates a reasonable description in plain English. NIC is
based on a convolution neural network that encodes an image into a compact representation,
followed by a recurrent neural network that generates a corresponding sentence. The presented
model is trained to maximize the likelihood of the sentence given the image. Experiments on
several datasets show the robustness of NIC in terms of qualitative results (the generated
sentences are very reasonable) and quantitative evaluations,using either ranking metrics or
BLEU, a metric used in machine translation to evaluate the quality of generated sentences. It
showed that as the size of the available datasets for image description increases, so will the
performance of approaches like NIC.
Collective Generation of Natural Image Descriptions [5], the authors present a holistic
data-driven approach to image description generation by exploiting the vast amount of (noisy)
parallel image data and associated natural language descriptions available on the web. More
specifically, given a query image, the model retrieves an existing human-composed phrase used
to describe visually similar images, then selectively combine those phrases to generate a novel
description for the query image. It casts the generation process as constraint optimization
problems, collectively incorporating multiple interconnected aspects of language composition for
content planning, surface realization and discourse structure. Evaluation by human annotators
indicates that their final system generates more semantically correct and linguistically appealing
descriptions than two nontrivial baselines.

Composing Simple Image Descriptions using Web-scale N-grams [6], the authors present a basic
yet an effective way to deal with automatically composing image descriptions given computer
vision based sources of info and utilizing web-scale n-grams. Unlike most past work that outlines
or recovers previous content applicable to an image, this technique forms sentences completely
without any preparation. Test results show that it is practical to produce basic simple descriptions
that are relevant to the particular content of an image, while allowing creativity in the description
– making for more human-like annotations than past methodologies. This methodology
comprises two stages: (n-gram) phrase selection and (n-gram) phrase fusion. The initial step –
phrase selection – gathers candidate phrases that might be conceivably helpful for producing the
description of a given image. The second step – phrase fusion – finds the ideal compatible set of
phrases utilizing dynamic programming to make another (and increasingly unpredictable) state
that depicts the image.

Multimodal Neural Language Models [7], the authors present two multimodal neural language
models: models of characteristic language that can be adapted on different modalities. An
imagetext multimodal neural language model can be utilized to recover images given complex
sentence queries, recover phrase descriptions given image queries, just as produce text
conditioned on images. In contrast to a significant number of the current techniques, this
methodology can produce sentence depictions for images without the utilization of formats,
structured prediction, and additionally syntactic trees. Rather, it depends on word representations
gained from a huge number of words and conditioning the model on high-level image features
gained from deep neural systems. They presented two strategies dependent on the log-bilinear
model of Mnih and Hinton (2007): the modality-biased log-bilinear model and the factored
3-way log-bilinear model. Word representations and image features are found out together by
mutually preparing our language models with a convolutional network.

Corpus-Guided Sentence Generation of Natural Images [8], the authors propose a sentence
generation strategy that describes images by predicting the most likely nouns, verbs, scenes and
prepositions that make up the core sentence structure. The input are initial noisy estimates of the
objects and scenes detected in the image using state of the art trained detectors. As predicting
actions from still images directly is unreliable, a language model is used to train from the English
Gigaword corpus to obtain their estimates; together with probabilities of co-located nouns,
scenes and prepositions. These estimates are used as parameters on a HMM that models the
sentence generation process, with hidden nodes as sentence components and image detections as
the emissions. The description of an image is the output of an extremely complex process that
involves: 1) perception in the Visual space, 2) grounding to World Knowledge in the Language
Space and 3) speech/text production. Experimental results show that this strategy of combining
vision and language produces readable and descriptive sentences compared to naive strategies
that use vision alone.

Recurrent Neural Network Regularization [9], the authors present a simple regularization
technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM)
units.Unfortunately, dropout Srivastava (2013), the most powerful regularization method for
feedforward neural networks, does not work well with RNNs. As a result, practical applications
of RNNs often use models that are too small because large RNNs tend to overfit. Existing
regularization methods give relatively small improvements for RNNs Graves (2013). In this
work, we show that dropout, when correctly used, greatly reduces overfitting in LSTMs, and
evaluates it on three different problems.

Matching Words and Pictures [10], in this paper they present a new approach for modeling
multi-modal data sets, focusing on the specific case of segmented images with associated text.
Learning the joint distribution of image regions and words has many applications. They consider
in detail predicting words associated with whole images (auto-annotation) and corresponding to
particular image regions (region naming). Auto-annotation might help organize and access large
collections of images. Region naming is a model of object recognition as a process of translating
image regions to words, much as one might translate from one language to another. Learning the
relationships between image regions and semantic correlates (words) is an interesting example of
multi-modal data mining, particularly because it is typically hard to apply data mining techniques
to collections of images. particularly because it is typically hard to apply data mining techniques
to collections of images. They develop a number of models for the joint distribution of image
regions and words, including several which explicitly learn the correspondence between regions
and words. We study multi-modal and correspondence extensions to Hofmann’s hierarchical
clustering/aspect model, a translation model adapted from statistical machine translation (Brown
et al.), and a multi-modal extension to a mixture of latent Dirichlet allocation (MoM-LDA). All
models are assessed using a large collection of annotated images of real scenes.

I2T: Image Parsing to Text Description [11], the authors present an Image to Text (I2T)
framework which converts image and video content into textual descriptions based on image (or
frame) understanding. The proposed framework follows three steps. In the first step, the input
images (or video frames) are disintegrated into their constituent visual patterns by an image
parsing engine. In the second step, the results from the first step are converted into semantic
representation in the form of Web Ontology Language (OWL). Finally, in the third step, the
OWL representation from the previous step is converted into semantically meaningful, human
readable and query-able text reports by a text generation engine. And-or-Graph (AoG) visual
knowledge representation is the highlight of the I2T framework. It provides a graphical
representation for learning categorical image and symbolic representations from a large-scale
image. It takes a top-down approach during image parsing and connects low-level image features
with high-level semantic conceptions so that the image can be translated into semantic meta-data
and eventually into a textual description. The I2T framework is particularly different because it
generates semantically meaningful annotations. Since the image and video contents are
converted to both OWL and text format, this framework can be merged with a full text search
engine, to provide error-free content-based retrieval. Users also have the ability to query images
and video clips based on keywords and semantics.

Detection and Recognition of Objects in Image Caption Generator System: A Deep Learning
Approach [12], this paper proposes a Image Caption Generator using deep learning. It aims to
generate captions for a given image based on mechanisms that involve both image processing
and computer vision. The mechanism detects the relationships between different people, objects
and animals in the image while capturing semantic meaning and converting it into natural
language. Regional Object Detector (RODe) is used for the purpose of detection, recognition and
generating captions. The proposed method centers on deep learning to additionally enhance the
existing system. This method is applied to a Flickr 8k dataset. It generated captions with more
expressive significance and descriptive meaning than the existing image caption generators.

An image conveys a message: A brief survey on image description generation [13], as there has
been a great interest in the research community to come up with new ways to automate retrieving
images based on content, this paper presents a brief survey of some technical aspects and
methods for description-generation of images. The paper briefly depicts the overview of some
current work and also the significant aspects of existing work. There are various image
description systems but results show that there is yet a requirement for a superior framework
with improved performance. Four key areas are recognized, in which future work can be carried
out to improve performance. Firstly, multiple sentence generation techniques ought to explore
different context settings. For better outcomes, a combination of various caption association
models may improve the performance. Secondly, for better performance of the description
generation system, there needs to be new assessment metrics, which do not influence the system
due to lack of detailed annotation. Thirdly, there needs to be a framework which produces
similar numerous descriptions of a picture with different contents. Fourthly, due to limited
judgement, prediction of little objects is still an issue. Additionally, high level description
generation annotation expressions may increase the performance.

Video Storytelling: Textual Summaries for Events [14], in this work the authors introduce the
problem of video storytelling. It aims at producing consistent and concise stories for long videos.
Due to the diversity of the story, the length as well as the complexity of the videos, video
storytelling presents new difficulties. The authors propose novel strategies to address the
difficulties. To start with, the authors propose a context-aware framework for multimodal
embedding learning, they plan a Residual Bidirectional RNN to use contextual information from
past and future. The multimodal embedding is then used to recover sentences for video clips.
Secondly, they propose a Narrator model to choose clips that are illustrative of the fundamental
storyline. The Narrator is planned as a reinforcement learning agent which is trained by directly
optimizing the textual metric of the generated story. They access the strategy on the Video Story
dataset, a dataset they have gathered to enable the study. They compare the method with various
state-of-the-art baselines and showcase that their technique accomplishes better performance in
regard to quantitative measures and user study.

Know More Say Less: Image Captioning Based on Scene Graphs [15], in this paper the authors
have proposed a framework for image captioning based on scene graphs. Recently, a few
methodologies have detected semantic concepts from images and afterward encoded them into
high level representations. Although considerable advancement has been accomplished, the
greater part of the previous methods treat entities in images independently, hence missing
organized information that gives significant cues for image captioning. Scene graphs contain a
large amount of organized data since they not only portray objects in images but also present
pairwise connections. To use both visual features and semantic information in organized scene
graphs, CNN features from the bounding box offsets of object entities are extracted for visual
representation as well as semantic relationship features from triplets (e.g, man eating apple) for
semantic representation. Subsequent to acquiring these features, the authors present a
hierarchical attention based module to learn discriminative highlights for word generation at each
step.
REQUIREMENTS ANALYSIS

Functional Requirements:

Functional
Requirement Functional Requirement Description
No.

FR 1 The dataset used should be from a reliable source and in proper format.

FR 2 The input image should be used to generate a 14 X 14 feature map using


convolutional feature extraction. This should reduce the dimensionality
of input images so that they can be used in RNN with attention over the
image.

FR3 Use two different attention mechanisms over the input image, namely
stochastic attention and deterministic attention and combine the result to
generate words for the caption.

FR 4 The splitting of available dataset into test data and training data should
be proper and a balanced split to produce a good classification model.

FR 5 Use a Long Short-Term Memory network (LSTM) that produces caption


by generating one word at every time step conditioned on a context
vector, the previous hidden state and the previously generated words.

FR 6 Gain insight and interpret the results of this framework by visualizing


“where” and “what” the attention focused on.
FR 7 Quantitatively validate the usefulness of attention in caption generation
with state of the art performance on three benchmark datasets: Flickr8k,
Flickr30k and the MS COCO dataset.
FR 8 The test of the trained model on holdout set should be performed and
result in maximum accuracy.
FR 9 The model should be able to generate captions for new photographs that
were not present in the data used to train it.

Non-Functional Requirements:

Non-Functional
Requirements Non-Functional Requirements Description
No.

NFR 1 Response Time of the developed model should be less and convenient for
users.

NFR 2 The model should be Scalable, that is, it should work for large data also
with the same deliverables.

NFR 3 The model should have high usability and interoperability.


Software Requirements:

Python-3.7.4:

It is an interpreted, high-level general-purpose programming language. It is chosen for project


work due to its high data-handling capacity and simplicity.

Python pickle:

Python pickle module is used for serializing and de-serializing a Python object structure. Any
object in Python can be pickled so that it can be saved on disk. What pickle does is that it
“serializes” the object first before writing it to file. Pickling is a way to convert a python object
(list, dict, etc.) into a character stream.

Nltk.translate.bleu_score package:

Returns an aligned sentence object, which encapsulates two sentences along with an Alignment
between them. Typically used in machine translation to represent a sentence and its translation.
The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for evaluating a
generated sentence to a reference sentence. The approach works by counting matching n-grams
in the candidate translation to n-grams in the reference text, where 1-gram or unigram would be
each token and a bigram comparison would be each word pair and the comparison is made
regardless of word order.

NumPy-v1.17​:

NumPy is the fundamental package for scientific computing in Python. It is a Python library that
provides a multidimensional array object, various derived objects (such as masked arrays and
matrices), and an assortment of routines for fast operations on arrays, including mathematical,
logical and shape manipulation.
Matplotlib-3.1.1​:

It is a plotting library for Python programming language and NumPy. It provides an


object-oriented API for embedding plots into applications.

keras:

It is an open source neural-network library written in Python. It contains commonly used


neural-networks building blocks such as layers and objectives. It is compatible with Python
version 2.7-3.7.

Hardware Requirements:

Operating System: Windows 10

Processor: 1.5GHz processor

RAM: 8 GB
SYSTEM ARCHITECTURE AND DESIGN

A standard encoder-decoder recurrent neural network architecture is used to address the image
caption generation problem. This involves two elements:

1. Encoder​: A network model that reads the photograph input and encodes the content into
a fixed-length vector using an internal representation.
2. Decoder​: A network model that reads the encoded photograph and generates the textual
description output.

Merge model, as described in Figure 1, is a type of encoder-decoder architecture. The merge


model combines both the encoded form of the image input with the encoded form of the text
description generated so far. The combination of these two encoded inputs is then used by a very
simple decoder model to generate the next word in the sequence. The approach uses the recurrent
neural network only to encode the text generated so far. This separates the concern of modeling
the image input, the text input and the combining and interpretation of the encoded inputs.

Figure 1: Merge Model

The proposed system is based on the “merge-model”. The schematic of the model is reproduced
in Figure 2.
Figure 2: Merge Model Schematic

The development of the deep learning model can be described in three parts:

● Photo Feature Extractor: ​This is a 16-layer VGG model pre-trained on the ImageNet
dataset. The photos will be pre-processed with the VGG model (without the output layer)
and will use the extracted features predicted by this model as input.

● Sequence Processor: ​This is a word embedding layer for handling the text input,
followed by a Long Short-Term Memory (LSTM) recurrent neural network layer.

● Decoder: ​Both the feature extractor and sequence processor output a fixed-length vector.
These are merged together and processed by a Dense layer to make a final prediction.

Figure 3: Development of Deep Learning Model


The Photo Feature Extractor model expects input photo features to be a vector of 4,096 elements.
These are processed by a Dense layer to produce a 256 element representation of the photo. The
Sequence Processor model expects input sequences with a predefined length (34 words) which
are fed into an Embedding layer that uses a mask to ignore padded values. This is followed by an
LSTM layer with 256 memory units. Both the input models produce a 256 element vector.
Further, both input models use regularization in the form of 50% dropout. This is to reduce
overfitting the training dataset, as this model configuration learns very fast. The Decoder model
merges the vectors from both input models using an addition operation. This is then fed to a
Dense 256 neuron layer and then to a final output Dense layer that makes a softmax prediction
over the entire output vocabulary for the next word in the sequence.

Figure 4 shows a plot to visualize the structure of the network that better helps understand the
two streams of input.
Figure 4: Network Structure

Figure 4 depicts the flow of two streams of input in the model as it passes through the various
LSTM units of Recurrent Neural Network (RNNs) to finally decide each word from vocabulary
describing the current focus area over the input image, while taking into account all the
previously used words while generating the caption in natural language.
The overall data flow diagram of the proposed system is shown in Figure 5.

Figure 5: Data Flow Diagram

The data flow diagram depicts the flow of data across our proposed model which starts with
preparing the photo data collected from different sources and preprocessing them to be in
expected input format. After this the text descriptions (captions) associated with each of the
images are prepared and both the text and image data are combined to develop the deep learning
model using a combination of photo feature extractor, sequence processor and decoder, which is
followed by training the developed model using progressive loading to ensure accuracy.
Evaluation or test of the developed model is performed using the test data splitted from the total
training set. Once the satisfactory or desired accuracy is reached the model can be used to
generate captions for the new input images.
REFERENCES

1. Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual
attention." International conference on machine learning. 2015.

2. Tanti, Marc, Albert Gatt, and Kenneth P. Camilleri. "What is the Role of Recurrent
Neural Networks (RNNs) in an Image Caption Generator?." arXiv preprint
arXiv:1708.02043 (2017).

3. Bernardi, Raffaella, et al. "Automatic description generation from images: A survey of


models, datasets, and evaluation measures." Journal of Artificial Intelligence Research 55
(2016): 409-442.

4. Vinyals, Oriol, et al. "Show and tell: A neural image caption generator." Proceedings of
the IEEE conference on computer vision and pattern recognition. 2015.

5. Kuznetsova, Polina, et al. "Collective generation of natural image descriptions."


Proceedings of the 50th Annual Meeting of the Association for Computational
Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 2012.

6. Li, Siming, et al. "Composing simple image descriptions using web-scale n-grams."
Proceedings of the Fifteenth Conference on Computational Natural Language Learning.
Association for Computational Linguistics, 2011.

7. Kiros, Ryan, Ruslan Salakhutdinov, and Rich Zemel. "Multimodal neural language
models." International conference on machine learning. 2014.
8. Yang, Yezhou, et al. "Corpus-guided sentence generation of natural images." Proceedings
of the Conference on Empirical Methods in Natural Language Processing. Association
for Computational Linguistics, 2011.

9. Zaremba, Wojciech, Ilya Sutskever, and Oriol Vinyals. "Recurrent neural network
regularization." arXiv preprint arXiv:1409.2329 (2014).

10. Barnard, Kobus, et al. "Matching words and pictures." Journal of machine learning
research 3.Feb (2003): 1107-1135.

11. Yao, Benjamin Z., et al. "I2t: Image parsing to text description." Proceedings of the IEEE
98.8 (2010): 1485-1508.

12. Kumar, N. Komal et al. “Detection and Recognition of Objects in Image Caption
Generator System: A Deep Learning Approach.” 2019 5th International Conference on
Advanced Computing & Communication Systems (ICACCS) (2019): 107-109.

13. Shabir, Sidra, and Syed Yasser Arafat. "An image conveys a message: A brief survey on
image description generation." 2018 1st International Conference on Power, Energy and
Smart Grid (ICPESG). IEEE, 2018.

14. Li, Junnan, et al. "Video Storytelling: Textual Summaries for Events." IEEE Transactions
on Multimedia (2019).

15. Li, Xiangyang, and Shuqiang Jiang. "Know more say less: Image captioning based on
scene graphs." IEEE Transactions on Multimedia 21.8 (2019): 2117-2130.

Das könnte Ihnen auch gefallen