Sie sind auf Seite 1von 22

2752

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 6, JUNE 2016

Text Detection, Tracking and Recognition in


Video: A Comprehensive Survey
Xu-Cheng Yin, Senior Member, IEEE, Ze-Yu Zuo, Shu Tian, and Cheng-Lin Liu, Fellow, IEEE

Abstract The intelligent analysis of video data is currently


in wide demand because a video is a major source of sensory
data in our lives. Text is a prominent and direct source of
information in video, while the recent surveys of text detection
and recognition in imagery focus mainly on text extraction from
scene images. Here, this paper presents a comprehensive survey of
text detection, tracking, and recognition in video with three major
contributions. First, a generic framework is proposed for video
text extraction that uniformly describes detection, tracking,
recognition, and their relations and interactions. Second, within
this framework, a variety of methods, systems, and evaluation
protocols of video text extraction are summarized, compared,
and analyzed. Existing text tracking techniques, tracking-based
detection and recognition techniques are specifically highlighted.
Third, related applications, prominent challenges, and future
directions for video text extraction (especially from scene videos
and web videos) are also thoroughly discussed.
Index Terms Text tracking, tracking based text detection,
tracking based text recognition, video text extraction, scene text.

I. I NTRODUCTION
HE explosive growth of smart phones and online social
media have led to the accumulation of large amounts
of visual data, in particular, the massive and increasing collections of video on the Internet and social networks. For
example, YouTube1 streamed approximately 100 hours of
video per minute worldwide in 2014. These countless videos
have triggered research activities in multimedia understanding
and video retrieval [3], [4].
In the literature, text has received increasing attention as
a key and direct information sources in video. As examples,

Manuscript received August 4, 2015; revised December 15, 2015,


February 18, 2016, and April 7, 2016; accepted April 7, 2016. Date of
publication April 14, 2016; date of current version April 29, 2016. This work
was supported by the National Natural Science Foundation of China under
Grant 61411136002 and Grant 61473036. The associate editor coordinating the review of this manuscript and approving it for publication
was Dr. Dimitrios Tzovaras.
X.-C. Yin is with the Department of Computer Science and Technology,
University of Science and Technology Beijing, Beijing 100083, China, and
also with the Beijing Key Laboratory of Materials Science Knowledge
Engineering, School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China (e-mail:
xuchengyin@ustb.edu.cn).
Z.-Y. Zuo and S. Tian are with the Department of Computer Science
and Technology, School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China (e-mail:
zzy0905@qq.com; tianshu0816@126.com).
C.-L. Liu is with the National Laboratory of Pattern Recognition, Institute
of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail:
liucl@nlpr.ia.ac.cn).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIP.2016.2554321
1 http://www.youtube.com/.

caption text usually annotates information concerning where


and when and the events in video happened or who was
involved [5], and signage text is widely used as a visual
indicators for navigation and notification in scenes [6]. Hence,
text extraction and analysis in video has attracted considerable
attention in multimedia understanding systems. For example, in recent years, all the top winners of the TRECVID
Multimedia Event Detection2 have combined video text, audio,
and visual features to construct their video retrieval systems.
Specifically, some researchers performed investigations of
video retrieval by leveraging both textual video representations (extracted from text in frames and audio) and visual
representations using high-level object and action concepts and
found that the ability to understand video text can significantly
improve the retrieval performance [7].
A wide variety of methods have been proposed to extract
text from images and videos, and several studies have contributed reviews [1], [2], [8][12]. Jung et al. comprehensively surveyed a large number of techniques for extracting
text information from images and videos [9]. Meanwhile,
Liang et al. extensively summarized various methods for
camera-based analysis and recognition of text and documents [10]. However, all the reviewed works in these
two surveys were published before 2004. More recently,
Zhang et al. classified and assessed new algorithms for text
extraction from scene images [12]. Most recently, Ye and
Doermann presented a broad summarization of the methods,
systems and challenges of text detection and recognition in
imagery [1]. Zhu et al. provided a comprehensive survey
of scene text detection and recognition with recent advances
and future trends [2]. However, all the recent surveys focus
mainly on text extraction from scene images. Despite this,
numerous advanced techniques for video text extraction have
proliferated impressively over the past decade. Although video
text extraction techniques have been addressed in previous
surveys [1], [9], [10], they were treated in either an imagebased framework or separated in tracking or enhancement
sections.
This paper presents a comprehensive survey of text detection, tracking and recognition methods and systems in video,
with a special focus on recent technical advancements.
In contrast to the previous surveys [1], [9], this survey uniformly summarizes and describes detection, tracking, recognition and their relations and interactions within a generic video
text extraction framework. In particular, to stress the analysis
2 http://nist.gov/itl/iad/mig/med.cfm.

1057-7149 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY

2753

Fig. 1. Text in video: (a) Layered caption text, (b) embedded caption text, and (c) scene text, where embedded caption text and scene text are more challenging
to detect, track and recognize.

of spatial-temporal information, we extensively review techniques of text tracking, tracking based detection, and tracking
based recognition. Finally, available datasets, representative
challenges, technological applications and future directions of
video text extraction (especially from scene videos and web
videos) are also described in depth in this paper.
A. Text in Video
Following the method of categorization in [9], text in video
is categorized as either caption or scene text (see examples
in Fig.1). Caption text is also called graphic text [1] or
artificial text [13]. Caption text provides good directivity and
a high-level overview of the semantic information in captions,
subtitles and annotations of the video, while scene text is
part of the camera images and is naturally embedded within
objects (e.g., trademarks, signboards and buildings) in scenes.
Moreover, we classify caption text into two subcategories: layered caption text and embedded caption text. Layered caption
text is always printed on a specifically designed background
layer (see Fig.1(a)), while embedded caption text is overlaid
and embedded on the frame (see Fig.1(b)). Generally speaking,
scene text and embedded caption text are more challenging to
detect, track and recognize which are also the focus of this
survey.
II. A U NIFIED F RAMEWORK FOR V IDEO T EXT
D ETECTION , T RACKING AND R ECOGNITION
Some researchers have presented specific frameworks for
video text extraction. For example, Antani et al. divided video
text extraction into four tasks: detection, localization, segmentation, and recognition [14]. In their system, the tracking
stage provides additional input to the spatial-temporal decision
fusion for improving localization. Jung et al. summarized the
subproblems of a text information extraction system for both
images and video into text detection, localization, tracking,
extraction and enhancement, and recognition [9]. The video
text recognition flowchart designed by Elagouni et al. [15] is
similar to that of [9], but added a correction (postprocessing)
step with natural language processing. In contrast, in this
paper we propose a unified video text extraction framework,
where text detection, tracking, and recognition techniques are
uniformly described and surveyed.

Fig. 2.
D ETR: A unified framework for text D Etection, Tracking and
Recognition in video. This framework uniformly describes detection, tracking,
recognition (the three main tasks), and their relations and interactions. The
major relations among these main tasks are unified as detection-basedrecognition, tracking-based-detection and tracking-based-recognition.
The other relations among these tasks are also named as refinement-byrecognition (for detection), refinement-by-recognition (for tracking) and
tracking-with-detection.

The unified video text DEtection, Tracking and


Recognition (DETR) framework is shown in Fig.2. Here,
Detection is the task of localizing the text in each video frame
with bounding boxes. Tracking is the task of maintaining the
integrity of the text location and tracking text across adjacent
frames. Recognition involves segmenting (if necessary)
text and recognizing it using Optical Character Recognition (OCR) techniques. Obviously, Recognition is performed
on text regions detected from Detection results (Detectionbased-Recognition), and Tracking uses the locations identified
in the Detection step to track text (Tracking-with-Detection).3
In general, Detection is first performed first in each frame
independently; then, the detected results in sequential frames
can be integrated and enhanced based on the Tracking results
(Tracking-based-Detection). Similarly, Recognition can help
3 In our paper, tracking-with-detection means using locations identified by
detection to track text, i.e., (1) using detection results in first one or several
frames to locate the initial location for tracking, (2) using detection results in
certain middle frames to verify the location of the tracking object, or (3) using
detection results in consecutive frames for linking a trajectory and tracking the
text. Actually, the last case is the same as tracking-by-detection, a typical
technique in the field of object tracking, which is described in Section IV-A.

2754

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 6, JUNE 2016

verify the Tracking results (Refinement-by-Recognition


for Tracking), and also confirm the Detection results in
some cases [16] (Refinement-by-Recognition for Detection).
Meanwhile, Tracking can improve Recognition by fusing the
recognition results over multiple frames (Tracking-basedRecognition).
Within the DETR framework, this survey mainly reviews
techniques with multiple frames in video, i.e., text tracking, tracking based detection, and tracking based recognition.
It first simply describes text detection (Section III-A) and
recognition (Section III-B) techniques in one video frame
(individual images). Next, text tracking techniques are extensively summarized in Section IV-A (Tracking-with-Detection
in Section IV-A3). Tracking-based-Detection and Trackingbased-Recognition techniques are then surveyed and highlighted in Sections IV-B and IV-C, respectively. Additionally,
evaluations and available datasets of video text extraction
are presented in Section V. Section VI describes related
applications, and Section VII discusses challenges and future
directions. The final summary is presented in Section VIII.
III. V IDEO T EXT D ETECTION AND R ECOGNITION
U SING I NDIVIDUAL F RAMES
As mentioned in Section I-A, this survey focuses on extracting embedded caption text and scene text from video. Many
video text extraction methods detect and recognize text in each
sampled individual frame (i.e., frame by frame) without multiframe integration [9], [11]. In this section, we simply summarize the detection and recognition techniques for (embedded)
captions and scene text in individual frames in recent years
(since 2004).
A. Text Detection
Existing methods for text detection can be roughly categorized into two major groups: connected component (CC) based
methods [17][20], and region based methods (also called
sliding window based methods) [21][33].
CC based methods extract character candidates from images
by connected component analysis followed by grouping character candidates into text, probably with additional checks to
remove false positives. CC based methods usually perform
well for captions that have uniform color and regular spacing;
however, CCs may not preserve the full shapes of characters
due to color bleeding and the low contrast of text lines.
Region based methods use a binary text/non-text classifier
to search for possible text regions over windows of multiple
scales and aspect ratios and then group the text regions into
text. The classifier may utilize various features including color,
edges, gradients, texture and other related region features
to distinguish between text and background. Candidate text
regions are first found with an edge map or gradient information in video frames. Subsequently, a refinement stage is
conducted using heuristic rules or learned classifiers. These
methods are fast and overcome low contrast problems, but they
produce many false positives when the background is complex.
To overcome this problem, texture features are utilized to
detect text in video frames [5], [34][44]. The base techniques

of texture features use various methods (e.g., Gabor filter,


fast Fourier transform, spatial variance, wavelet transform, or
multi-channel processing) to calculate the texture of blocks.
Then, proper classifiers are employed to classify text blocks
and non-text blocks. These techniques fail when text-like textures appear in the background. Some other methods integrate
hybrid features to distinguish text from the background such
as texture and edge features [45], color and edge/gradient
features [46], [47], and wavelet and color features [40].
In recent years, numerous research efforts have focused
on detecting scene text in video. For example, some techniques [32], [33], [44] have been used to detect scene text in
individual frames. Consequently, we will also briefly review
the major recent methods for text detection from scene
images.4
Scene text detection methods can also be categorized into the same two major groups: connected component (CC) based methods [6], [48][53], and region based
methods [36], [54][57]. Pan et al. also proposed a hybrid
method [58] that exploits a sliding window classifier to detect
text candidates and then extracts connected components as
character candidates, which are verified as text or non-text
using a Conditional Random Fields (CRFs) model [59].
Among the CC based methods, the ones based on
Maximally Stable Extremal Regions (MSERs) [60], and Stroke
Width Transform (SWT) [50] are outstanding. MSER/ER
based methods utilize the color (intensity) uniformity of text
strokes [6], [61][69], while SWT utilizes the uniformity of of
the width of text strokes to detect text [50], [70][72]. Sliding
window based methods usually use an AdaBoost classifier
for fast detection [54], or more recently, use deep learning
techniques (Convolution Neural Networks, CNNs) to improve
text/non-text discrimination [56], [73].
B. Text Recognition
Video text recognition is conventionally performed using
existing OCR techniques; in other words, text regions are first
segmented from video frames and then fed into a state-of-theart OCR engine [9]. However, the recognition performance
relies heavily on text segmentation/binarization (removing the
background) and may suffer from noise and distortion in
complex videos. Hence, several methods have been specifically
designed for video text recognition.
One strategy is to design a totally new text recognition framework. For example, in a method proposed by
Elagouni et al. [74], text extracted from videos is first represented as sequences of learned features with a multi-scale
scanning scheme. The sequence of features is fed into a
connectionist recurrent model specifically designed to take
the dependencies between successive learned features into
account. Finally, the recurrent model recognizes video words
holisticallywithout any explicit character segmentation.
The other strategy is to combine well-formulated OCR
techniques with novel recognition methods. For example,
4 More related details about text detection and recognition techniques in
scene images can be referred to [1].

YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY

Fig. 3.

2755

A tree-style presentation of the different categories of methods for text tracking, detection and recognition using multiple frames.

Elagouni et al. [15] reduced the ambiguities involved in character segmentation by considering character recognition results
and introducing some linguistic knowledge. Chen et al. [75]
described a multiple-hypotheses framework and proposed a
new gray-scale consistency constraint (GCC) algorithm to
improve character segmentation. Saidane and Garcia [76]
presented a video text-oriented binarization method using a
specific architecture of convolutional neural networks.
Similarly, recognizing text in scene videos has attracted
more and more interests from researchers in the fields of document analysis and recognition, computer vision, and machine
learning [77]. Therefore, we also briefly summarize several
representative methods for text recognition in scene images.
As for scene text (cropped word) recognition, the existing methods can be grouped into segmentation-based word
recognition and holistic word recognition (word spotting [78]).
In general, segmentation-based word recognition methods
integrate character segmentation and character recognition
with language priors using optimization techniques, such as
Markov models [79] and CRFs [80], [81]. Given a lexicon of
words, the goal in word spotting is to identify specific words
in scene images without character segmentation [82]. Most
text recognition methods rely on text segmentation (removing
background). Fortunately, most CC-based detection methods
and some region-based methods already output text images
without the background. On the other hand, some recognition
methods use classifiers (such as CNNs) directly on text regions
mixed with the background. However, this approach requires
a large number of training samples of various text fonts and
backgrounds to train the classifier.
In recent years, mainstream segmentation-based word
recognition techniques typically over-segment the word image
into small segments, combine adjacent segments into candidate
characters, classify them using CNNs or gradient featurebased classifiers, and find an approximately optimal word
recognition result using beam search [83], Hidden Markov
Models [84], or dynamic programming [73]. Word spotting
methods usually calculate a similarity measure between the

candidate word image and a query word. Impressively, some


recent methods design an appropriate CNN architecture and
train the CNNs directly on holistic word images [85], [86], or
use label embedding techniques to enrich the relations between
word images and text strings [87], [88].
IV. V IDEO T EXT D ETECTION , T RACKING AND
R ECOGNITION U SING M ULTIPLE F RAMES
In video content understanding, spatial-temporal analysis and multi-frame integration are general strategies.
Compared to images, the temporal information (the dependencies between adjacent video frames) is helpful in improving
text detection and recognition. Correspondingly, using spatial
and temporal information acquired from multiple frames,
video text tracking, tracking based detection, and tracking
based recognition methods are comprehensively surveyed and
highlighted in this section. The categorizations of these methods are summarized in Fig. 3, where text tracking methods are
divided into subgroups based on typical tracking strategies,
while tracking based detection and recognition techniques
are categorized with multi-frame integration fusion strategies.
Fig. 3 includes three major topics that use multiple frames
in the unified framework of video text extraction (Fig. 2),
i.e., text tracking, tracking based detection and tracking based
recognition, which are also highlighted in this survey.
A. Text Tracking
The goal of text tracking is to continuously determine
the location of text across multiple dynamic video frames.
Text tracking is useful for verification, integration, enhancement and speedup in video text detection and recognition
(e.g., tracking based detection and tracking based recognition
in the DETR framework in Fig.2). In recent years, a variety
of text tracking methods have been investigated in the literature. We roughly categorize these text tracking methods into
two groups: tracking with detection and refinement by recognition. In tracking with refinement by recognition, text recognition results from multiple frames are combined to verify

2756

and enhance text tracking. Few research efforts have been


conducted on this topic. To our knowledge, only one related
work [89] exists in which the edit distance between the
a recognized word in the current frame and the candidate
word in the next frame is regarded as one feature for text
matching. For tracking with detection, objects or detected text
positions are used to track text across consecutive frames. The
methods in this group can be further divided into several subgroups based on the typical tracking strategies [90], i.e., text
tracking with template matching, with particle filtering, and
with tracking-by-detection.5 These methods are summarized
in detail in this section. We also discuss text tracking in the
compressed domain and text tracking with other techniques.
1) Tracking With Template Matching: The template
matching method attempts to answer some variations of the
following question: Does the image contain a specified view
of some features and if so, where? As one of the conventional
methods of text tracking, it implements tracking by seeking
the most similar region in the image compared with a template
image (patch). Therefore, feature representation, and similarity
calculation and (matching) search are the two important factors
for designing a template matching algorithm.
Feature Representation: As described and declared in [92],
the feature extractor is the most important component of
a tracker. Using proper features can dramatically improve
tracking performance. However, exactly what constitutes a
good and effective feature representation for tracking is still an
open problem. A detailed study comparing the performances
of a variety of local features is given in [93]. Here, we discuss
only some of the features used in video text tracking.
Typically, characters belonging to the same text have the
same or similar colors. This property makes color information
an obvious feature for text tracking. Intensity values [94], [95]
and cumulative histograms [96], [97] are typical examples.
Color information is robust relative to situations of nonrigid image deformation and multi-oriented or multi-scale text.
However, it is sensitive to illumination, to occlusion and to
backgrounds with the same or similar color and results in
undesirable precision. In addition, most videos are stored with
lossy compression, which may cause the text color to bleed.
In another work, projection profile based features are
designed for matching blocks of text [98]. In this method,
an exhaustive search algorithm is used for motion prediction,
where the text regions whose centers fall into a search window
around the center of the reference text region are compared
with the reference text region to find the optimum position of
the text in the next frame. Lyu et al. [21] also used a projection
profiles-based method for multi-frame verification, which can
effectively track text when the camera is zooming in or out.
Caption text and scene text are usually designed to be read
easily, thereby resulting in strong edges at the boundaries of
the text and the background. Thus, text contour and edge
features can be extracted and used for tracking. For example,
to track text with complex motions, text contour information
5 In object tracking, a tracking-by-detection strategy always first detects
the targets in a pre-processing step by background subtraction or using
a discriminative classifier, and then estimates the trajectories with these
detection results [91].

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 6, JUNE 2016

is utilized to refine the position [34]. Sobel edge features can


also be extracted and examined to estimate whether the block
contains text components [99].
The Scale Invariant Feature Transform (SIFT) [100] is a
method to detect distinctive, invariant image feature points
that can efficiently solve problems such as multi-scale, illumination, occlusion, clutter and so on. As reported in [93],
SIFT performs the best among different types of local features. Therefore, it is widely used in video text tracking. For
example, Na and Wen [101] extracted SIFT features from the
reference text block and a candidate region. To estimate text
motion, the candidate region is made larger than the reference
text block by adding the size of the reference text block.
Phan et al. [102] used a combination of SIFT and Stroke
Width Transform (SWT) to extract much more keypoints and
descriptors.
Furthermore, to improve the efficiency of SIFT feature
extraction, the Speeded Up Robust Feature (SURF) [103] was
used as matching feature for tracking video text in [104].
SURF is a variant of SIFT and shares equal repeatability,
distinctiveness and robustness.
Similarity Calculation and Search: After feature extraction (feature representation), the template is simply generated
according to the represented features. For example, a cumulative histogram of intensity can be used as the template for
matching [96], [97]. The next important step in Tracking with
Template Matching is similarity calculation and (matching)
search, where the key component is the similarity evaluation
criteria that determines which candidate is the best match for
the template.
Li et al. [34] and Li and Doermann [106], [107] proposed
a gradually improved evaluation strategy. They first used the
Sum of Square Different (SSD) based image matching to track
text in a pure translational model [105]. The position with
the minimum SSD in a search region will be regarded as the
matched position. In [106], the Mean Square Error (MSE)
was added to measure the dissimilarity because the SSD
based matching always returns a position even when the
matched block does not contain text. The SSD-based module
is a region-based search strategy and has high computational
costs for tracking large text blocks. Thus, multi-resolution
matching based SSD was applied to reduce complexity in [34].
Additionally, Xi et al. employed the same text tracking strategy
for video text extraction [107].
Instead of the Least Mean Square Error (LMSE),
Huang et al. used the Sum of Absolute Differences (SAD)
algorithm to reduce the number of square operations and
compute motion vectors [99]. Similarly, to compare two cumulative histograms, the distance between them is defined as the
SAD in one of the RGB color channels [96], [97].
Fragoso et al. [108] developed a mobile Augmented
Reality (AR) translation system where the tracking system
is based on an Efficient Second-order Minimization (ESM)
method [109]. This method has a higher convergence rate than
other minimization techniques and avoids the local minima
problem (always close to the global one). This method tracks
text regions by iteratively minimizing the difference between
a reference frame (the template) and the current frame.

YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY

In conventional methods, the kd-trees and the nearest


neighbor (NN) algorithm are used to match SIFT features
between images, and the RANSAC algorithm [110] is used
to compute geometrical transforms from feature matches.
Na and Wen [101] presented a multilingual video text tracking algorithm based on SIFT features [100] and geometric
constraints. In this method, SIFT features are first extracted
from the reference text block and the candidate region. Then,
the nearest neighbor (NN) algorithm was used to match the
keypoints between the two sets of points. Meanwhile, a global
matching method using geometric constraints was proposed to
reduce false matching and improve the accuracy of tracking.
Based on correct matching, the motion of the text block is
estimated across consecutive frames and a match score of text
block is calculated to determine the frame where text appears
or disappears. This method is robust for tracking caption text
and scene text with complex backgrounds and illumination
changes but at the cost of a rather high computational complexity. Similarly, Phan et al. [102] proposed a SIFT-based
text tracking method. The main differences between these two
methods are as follows. First, Phan et al. used a combination
of SIFT and SWT to extract more keypoints and descriptors.
Second, both RANSAC and NN algorithms were used to
obtain the homography between the two sets of descriptors.
Finally, the ending frame of a text block was determined based
on the number of matched descriptors.
The above SIFT matching strategies can also be applied to
SURF matching. Yusufu et al. [104] used SURF features and
a fast approximate nearest-neighbor search algorithm to track
static and rigid moving text objects. For static text, the ending
frame is determined by a change in the number of SURF
feature points in corresponding regions across consecutive
frames. For moving text, a fast approximate nearest-neighbor
search algorithm [111] was used to match keypoints between
the two sets of points. Moreover, instead of the RANSAC
algorithm, a fast histogram based algorithm was designed to
eliminate false matching. Text motion estimation is based on
correct matching across adjacent frames. Thus, the bounding
box position is also utilized to determine the ending frame.
This method is viable and may have better performance than
SIFT features in video text tracking because it enjoys not only
a low time complexity but also robustness with respect to scene
changes. However, it does not work well for text with arbitrary
motion due to camera or object movement.
More recently, Gomez and Karatzas [112] described an
MSER-based real-time text tracking method built upon the
framework proposed by Donoser and Bischof [113] but different because it considers the specificities of text regions. This
MSER-based text tracking method uses invariant moments
as features to find correspondences and considers groups of
regions (text lines) instead of a single MSER. Moreover,
a RANSAC algorithm is used to detect mismatches. The best
matching region is obtained in a window surrounding the
previous location, which is obviously faster than one computed
using the entire image. The method can address rotation,
translation, scaling, and perspective deformations of detected
text. Its main limitation is that the tracking degrades in the
presence of severe motion blur or strong illumination changes.

2757

In summary, template matching based text tracking methods


obtain a fairly good performance in situations with complex
backgrounds, where free text objects are in motion, and even
in low resolution frames due to proper feature representation
(e.g., the advanced feature extraction techniques (SIFT, SURF,
SWT and MSER) and suitable similarity calculation and
searching (e.g., the matching techniques described above).
2) Tracking With Particle Filtering: Particle filtering, also
known as the Sequential Monte Carlo Method, is a nonlinear
filtering technique that recursively estimates a systems state
based on available observations. It solves the problem that
state variables do not follow a Gaussian distribution. Particle
filters are popular in computer vision, especially for object
detection and tracking. Thus, they are widely used in video
text tracking. The main steps in particle filtering include
feature extraction, the observation model and the sampling
scheme. Here, we summarize video text tracking techniques
with particle filtering into these subgroups.
Feature Extraction: As mentioned earlier, the feature extractor is very important for a tracker. Although particle filtering does not require a specific object representation, we
give priority to some robust features that have achieved the
most improved tracking performance. In video text tracking, visual features that have been used include cumulative histograms [114], [115], projection profiles [116], [117],
histograms of oriented gradients (HOG) [118].
Observation Model: In particle filtering, the conditional
state density p(X t |Z t ) at time t is represented by a set of
(n)

(n)

samples {st : n = 1, . . . , N} (particles) with weights t


(sampling probability). The weight defines the importance of
a sample, i.e., its observation frequency. In [114] and [115],
if a particle falls into a text block, its weight is set to
the similarity value calculated between the previous and the
current text blocks; otherwise, it is assigned zero. In these
methods, s1,2 = d1,21+ is used to calculate the similarity
between two text blocks (Text 1 and Text 2), where is a
small value to avoid divergence, and d1,2 denotes the distance
between two cumulative histograms in one of the RGB color
channels c [97], namely,
d1,2 =

255

c

|H1,c (z) H2,c (z)|.

z=0

Moreover, the projected features positions and the actual


text components found in the frames can also be utilized to
calculate the weight of each particle [116], [117]. Meanwhile,
the Battacharyya similarity coefficient of the HOG descriptors
is used to define the observation model [118].
Sampling Scheme: The new samples at time t can be drawn
(n)
(n)
(n)
from St 1 = {(St 1 , t 1 , ct 1 ) : n = 1, . . . , N} at the
previous time step t 1 based on different sampling schemes.
For example, particles may be scattered around the predicted
center point of the text region in the current frame from the
center point in the previous frame [114], [115]. Similarly,
Merino et al. designed a uniform and Gaussian random
walks strategy around an uncertainty window of the predicted
position for sampling [116], [117] due to the unpredictable
nature of erratic movements.

2758

3) Tracking With Tracking-by-Detection: The tracking-bydetection method associates detected results across successive
frames to create the appearances of objects, namely, estimating
the tracking trajectories using text detection results. Compared
to other tracking methods, tracking-by-detection successfully
solves the re-initialization problem even when the object is
accidentally lost in some frames. It also avoids excessive
model drift due to the similar appearances of different objects.
Here, we discuss tracking-by-detection methods in a view
with the features used for region matching (matching regions
from text detection) moving from simple to complicated,
e.g., location overlap, edge maps, character strokes, Harris
corners, and MSERs.
Wolf et al. [119] described a simple matching strategy for
tracking-by-detection that makes use of the overlap information between the list of text bounding boxes detected in
the previous and current frames to associate the same text.
To further reduce false alarms, the length of the appearance
is used as a measure of stability. However, because overlap
information is used, the method is not suitable for handling
text in motion.
Similarly, Mi et al. [120] used text region location similarity
and edge map similarity to determine the starting and ending
frames of frame sequences containing the same text object.
Region location similarity is referenced by the overlap of two
text regions in different frames. Only when the two similarities
are greater than a specified threshold value, the text regions
are considered the same; otherwise, they are considered as
different text regions.
Except for overlap information, text polarity and character
strokes are also used to verify whether the two text blocks are
identical in consecutive frames [121]. If the two text blocks
possess the same text polarity and character strokes, and one
text block overlaps the other to a sufficient extent, then they
are considered as the same text. The systems performance
may be enhanced by increasing the usage of text polarity and
character strokes. However, this method only addresses static
text effectively.
Petter et al. [20] extended the AR text translator of
Fragoso et al. [108] with automatic text detection by using
connected component analysis of the Canny edge detector outputs. However, these two systems differ in their handling of the
temporal redundancy. The system of Petter et al. [20] detects
text in each frame and performs matching them between the
previous and current frames based on the areas and centroids
of the detected text blocks. In constrast, the tracking method
used by Fragoso et al. [108] is more robust and effective with
the ESM algorithm.
Zhen and Zhiqiang [122] presented a text tracking method
for static text by fusing detection results. Their method first
extracts the Harris corner features of text and uses them to
search for the corresponding position in the current frame.
Then, the Hausdorff distance is applied to measure the dissimilarity. When the Hausdorff distance between a text block
and the reference text block is less than a given threshold, the
text block is considered to be a tracked (same) text block.
Liu and Wang [123] proposed a robust method for
extracting captions in video based on stroke-like edges and

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 6, JUNE 2016

spatial-temporal analysis. First, a stroke-like edge filter is


used to detect the caption regions in each frame. Corners and
stroke-like edges are then combined to identify the candidate
text regions. Finally, the distributions of stroke-like edges
between adjacent frames are compared and used for text
tracking.
Wang et al. [124] also used a combination of Canny
edge and Harris corner distributions for static and rolling
text localization and tracking. Canny edge and Harris corner
distributions of text blocks in different frames are calculated
and compared to determine text block matches and then, the
text motion type is decided by analyzing the variation in
the location statistics of matched text blocks. For static text,
a binary search algorithm is utilized to determine the starting
and ending frames of every matched text block, while for
scrolling text, the text scrolling direction and velocity can be
determined by some statistical analysis. However, the method
cannot accurately determine the positions of the starting and
ending frames of text blocks.
Recently, Nguyen et al. [89] made use of temporal redundancy to improve detection and recognition performance. The
overlap ratio between the text block in the current frame and
text blocks in the preceding and following N frames is first
used to remove false positions. Then, three features are taken
into account in the tracking-by-detection process, namely, the
overlap ratio, the temporal distance measured by the number
of frames between the current text block and the candidate
in the prior frames, and the edit distance between the current
word and the candidate word. A linear classifier is applied
to determine whether the current text block matches with a
similar block in previous frames. Finally, both the word scores
and the text block matches are linearly interpolated to recover
missing detection results.
Near that same time, Rong et al. [125] tracked scene text
using tracking-by-detection. The location of each scene text
character (STC) is determined by an MSER detector and used
as a constraint to optimize the trajectory search. The optimized
trajectory estimation is also used to guide text detection in
subsequenent frames and reduce the impact of motion blur.
The two processes are performed iteratively to track STC
bounding boxes.
4) Tracking in the Compressed Domain: Because most
videos are stored, processed and transmitted in the compressed format, several proposed text tracking methods operate
directly, quickly and effectively in the compressed domain.
The Discrete Cosine Transform (DCT) coefficients and motion
vectors in compressed formats are useful in the text tracking
process. Here, we will mainly review text tracking methods in
the compressed domain in the last ten years; a survey of the
older literature can be referred to [9].
The general text tracking strategy in the compressed domain
is to use motion vectors in compressed streams to assess the
motion similarity of the text macroblocks to track detected
text [126][129]. Specifically, all the macroblocks that match
the current text block can be utilized, and the vector is
identified by the greatest number of macroblocks [129]. The
original bounding box is then moved by the amount indicated
by the vector. Finally, a least-square-error search of a small

YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY

neighborhood is performed to precisely locate the matching


pixels.
Similarly, Crandall et al. [130] exploited DCT coefficients
to detect text and designed two methods for text tracking.
The first method makes use of MPEG motion vectors to track
rigid text whose font, size or color does not change over time.
To refine the tracking results, an edge pixel based search
algorithm is used on a small neighborhood of the tracked
text block. The second method uses a tracking-by-detection
strategy to address text that changes in font, size or color over
time. This method also employs an edge pixel based matching
algorithm to determine whether two text blocks contain the
same text.
Gllavata et al. [131] proposed a method to track text within
a group of pictures (GOP) in which MPEG motion vector
information extracted directly from the compressed video
stream is utilized to predict the position of text in the next
frame. This method makes efficient use of computation time.
Text macroblocks are first obtained in B and P frames that
intersect at more than a predefined ratio with a given text box.
Then, forward motion vectors are extracted from these text
macroblocks. Moreover, the method obtains a mode motion
vector (MV) from the motion vector set. Finally, the new position of the given text block is calculated by the MV. In contrast
to the method in [130], this proposed method analyzes fewer
macroblocks, avoids a compute-intensive clustering process
and computes motion vectors only when considering and
describing the text region and its motion. However, this
method is not applicable for tracking thin text whose area
of intersection with the background is small.
Qian et al. [13] presented a rolling text tracking method
with DCT texture intensity projection profiles and a static text
line tracking method with direct current (DC) image analysis.
For rolling text tracking, a shift matching strategy is used
to find the best matching position, and the matching error is
applied to determine whether the two text lines are the same
or not. The rolling speed is utilized to determine the frames
in which a rolling text line appears and disappears. For static
text line tracking, Yeo and Liu [132] proposed a fast algorithm
first extracts the DC images of the corresponding text line
regions. Then, the mean absolute difference (MAD) of the DC
images is used to determine the disappearing frame between
the consecutive frames. Later, Jiang et al. proposed another
fast text tracking algorithm [133] that is actually an enhanced
version of [13]. Instead of the shift matching method, a Line
Against Line Matching (LALM) based method determines
whether detected text blocks in consecutive frames are similar.
This method conducts DC-based line matching to determine
the starting and ending frames for each text. Experiments show
that the proposed method has low time complexity and is
effective for both static text and rolling text that has a constant
speed.
To improve the robustness of text tracking, key text
points (KTPs) are introduced in [42]. A KTP is defined as the
point that has a strong texture structure in multiple orientations
simultaneously, and extracted by three high-frequency subbands obtained from a wavelet transform. An acceleration
technique [133] is utilized to quickly track and extract text.

2759

Moreover, a similarity measure, namely, the MAD at the


KTPs (MADKTP), is proposed to reduce the influences of
background variation. The number of KTPs in the regions is
calculated to remove background interference and to extract
text for text segmentation.
5) Tracking With Other Techniques: Because text tracking
is a special case of object tracking, in addition to the methods
described above, several other related object-tracking strategies have also been introduced for video text tracking such as
CAMSHIFT and optical flow.
Kim et al. [36] proposed a texture based text detection
method that combines support vector machines (SVMs) and
continuously adaptive mean shift (CAMSHIFT) [134]. This
method first uses a support vector machine to analyze the textural properties of text in images, and then applies CAMSHIFT
to locate and track the text regions. The proposed method is
robust and works well for complex and textured backgrounds.
Myers and Burns [135] proposed a robust tracking method
for scene text with arbitrary 3D rigid motion and scale
changes. Key points are first selected using texture and coverage. Then, normalized correlations of a small image patch
centered at the current point position are applied to track
and locate the selected points. The new point position is
located to sub-pixel precision by quadratic interpolation of
the correlation surface. Point tracking is terminated when the
correlation is less than a given threshold. Finally, a combination of RANSAC and multi-frame reconstruction is used to
estimate the transformation of the whole region in blocks of
multiple frames. This method is verified to be effective for
tracking text in low resolution and noisy videos.
Zhao et al. [136] used optical flow as motion features for
detecting moving captions. Along with the multi-resolution
Lucas-Kanade algorithm, this method uses optical flow estimation to compute an approximation to the motion field from
the intensity differences of two consecutive frames. A decision
tree is then used to classify and detect the moving captions,
where the total area of text moving in the main direction is
selected as the main feature.
Recently, Mosleh et al. [137] used a tracking and motion
analysis scheme to separate overlaid text (caption text) from
scene text in video. This method first uses the CAMSHIFT
algorithm on each detected text to infer the text motion. Then,
the Lucas-Kanade optical flow computation algorithm is used
to estimate both the local motion field of each text object and
the global motion field of the video. When the local motion
dissimilarity with the global motion exceeds a given threshold
and the local motion is static, horizontal or vertical, the text
object is considered to be overlaid text.
More recently, Merino-Gracia and Mirmehdi [138]
presented a complete end-to-end scene text reading system.
For text tracking, rather than using the particle filter method
from their previous work [116], they used the unscented
Kalman filter (UKF) [139] to track scene text. A constant
velocity model is used for the prediction stage of the filter,
and normalized cross correlation (NCC) is used to perform
text matching. This method focuses on large text found
in outdoor environments, maximizes the use of available
processing power, and operates in real time.

2760

B. Tracking Based Detection


Conventional methods for video text detection mainly focus
on detecting text in each individual frame or in some key
frames. However, these methods cannot obtain high detection
accuracy due to complex backgrounds, poor contrast, and
degraded text quality caused by lossy compression. It is worth
noting that a key characteristic of video text is temporal
redundancy. Thus text tracking techniques are introduced in
the detection process to reduce false alarms and improve
the accuracy of detection. We call these strategies tracking
based detection methods. In general, these methods can be
categorized into temporal-spatial information based methods
and fusion based methods. The former methods directly use
temporal or spatial information to remove noise. The latter
methods merge detection and tracking results over multiple
frames, or combine detection and tracking results in a single
frame to improve the detection accuracy.6
1) Temporal-Spatial Information Based Methods: The temporal and spatial information is directly used to reduce false
alarms in video text extraction, such as the duration of the text,
i.e., the interval between the starting frame and the ending
frame of the same text.
In [98], text regions that persist for less than a second
or have a drop-out rate of more than 25% are discarded.
The interval between the starting frame and the ending frame
of the same text can be considered as the length of the
text trajectory. This trajectory is utilized to decide whether
the tracked text is effective [95][97], [107], [114], [115],
[121], [122]. The text trajectory is accepted as a valid text
trajectory until its length exceeds a given threshold value;
otherwise, it is regarded as noise and discarded. For example,
a region that has been detected or tracked should be selected
in at least 10 individual frames [122] or in at least 3 individual
frames [96], [97], [107], [114], [115]. A similar strategy [95]
is to discard tracked text that is not continuous over the last
6 frames or whose forecasted location lies outside the frame.
2) Fusion Based Methods: Fusion based methods can further be categorized into two groups: techniques that employ
multiple frame integration (MFI), and techniques that integrate
detection and tracking results in a single frame (detectiontracking fusion methods).
MFI techniques such as multi-frame averaging [124] and
time-based minimum/maximum pixel value searching [120],
are employed to reduce the influence of complex backgrounds.
In these methods, the detection result is used as a constraint to
optimize the trajectory search. Optimized trajectory estimation
is then fed back to guide text detection within a subsequence.
Thus, the integration of detection and tracking results can
reduce the effect of motion blur and improve the accuracy
of detection.
Specifically, for a text trajectory, Wang et al. [124] performed average integration on the base frame and its 10 subsequent and preceding frames (21 frames in total). Then, they
detected text in the integrated frame again. The detected text
6 Here, fusion-based methods focus on different fusing strategies to combine
detection and recognition, or detection and tracking results. Obviously, they
are based on temporal-spatial characteristics of text.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 6, JUNE 2016

is the final result of text detection. In contrast, Mi et al. [120]


obtained 30 consecutive frames in the middle of the starting frame and the ending frame which contain the same
text. Assuming that captions have high-intensity values, they
obtained the minimum value of the corresponding pixels from
consecutive frames to form the integrated frame.
For fusion based methods with detection and tracking
results in a single frame (detection-tracking fusion methods),
integrating a detection text region in an individual frame with
those previously tracked and the output is used to enable false
positive suppression [112], [118]. In [118], the position, size
and color histogram information of a detected and tracked
text region are computed, and matching is performed by the
Hungarian algorithm. Tracked text regions that are matched or
have a positive score are selected, and the remaining regions
are discarded. In [112], the tracked text line is updated by
newly detected MSERs, which thus regenerates the tracking
process. Moreover, this method can restart tracking of regions
that are lost during a previous tracking process.
Recently, Zuo et al. [140] proposed a multi-strategy tracking
based text detection method in scene videos, where trackingby-detection, spatial-temporal context learning, and linear
prediction are all performed to predict the candidate text
location sequentially and the best matching text block from the
candidates is adaptively selected using a rule-based method.
Tian et al. [141] proposed a unified tracking based scene
text detection system by learning locally and globally, which
uniformly integrates detection, tracking and their interactions
with dynamic programming.
C. Tracking Based Recognition
In the literature, most proposed methods for text recognition
mainly focus on recognizing text in a single image (frame).
OCR engines and other commercial recognition softwares have
been widely used for recognizing text in images. However,
they cannot always achieve good performance in each individual video frame due to complex backgrounds, low contrast,
and poor resolution. Fortunately, because tracking techniques
have been introduced in video text extraction methods and
the same text usually has the same ID label in tracking
results, MFI methods can make use of temporal redundancy
to obtain a frame with a cleaner background, higher contrast,
and better resolution. This is a fast and effective way to
improve the accuracy of text recognition. In general, MFI
techniques for video text recognition can be divided into two
major categories: image enhancement, which integrates the
same text region images (patches) to obtain a high-resolution
image through techniques such as multi-frame averaging, timebased minimum/maximum pixel value searching, and so on,
and recognition results fusion, which combines recognition
results from different frames into a final text string.
1) Image Enhancement: Image enhancement technology
uses selection-based or integration-based strategies to obtain
a high-resolution image from the text regions that contain the same text. The selection-based methods choose the
text region with highest resolution from the text regions in
consecutive frames. The integration-based methods, in contrast, obtain a high-resolution image through image fusion,

YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY

which mainly includes multi-frame averaging, time-based minimum/maximum pixel value searching, and Boolean And
operations.
To achieve the better results, it is necessary to select the
most appropriate text regions for character recognition. We call
techniques with this simple and basic strategy selection-based
methods. In [96], [97], [114], the region with the longest
horizontal length is selected as the most appropriate region
because characters in this region are usually the biggest.
Furthermore, Goto and Tanaka [115] changed the selection
algorithm to avoid a delay in message presentation by taking
six features into account: text region area and width, Fishers
discriminant ratio, the number of vertical edges, the sum of
the absolute values of the vertical components, and the vertical
edge intensity. If a feature value is at a local maximum and
is the highest among the past peaks in a chain as calculated
every two seconds, the text region will be passed to the next
process.
Obviously, the selection-based methods have some limitations regarding blurred text. In contrast, in such situations, the
integration-based methods may obtain better results. The most
common strategy is multi-frame averaging, where tracked text
regions with the same ID label are averaged to obtain a new
text region. The integration-based methods filter out complex
local backgrounds and improve text quality, thereby increasing
the recognition accuracy. Multi-frame averaging is used in
many tracking based text recognition techniques [42], [102],
[106], [107], [119], [122], [124], [142], [143]. Specifically, the
average operation can be applied to produce an average luminance image [42]. Xi et al. [107] used multi-frame averaging
to obtain a new intensity image after tracking a text block over
5 consecutive frames, where each text block is first enlarged
before performing the averaging operation. Phan et al. [102]
first employed a multi-frame averaging technique in mask
areas corresponding to original text regions. Then, a simple thresholding was applied to the average intensity image
to obtain an initial binarization of the word area. Finally,
the shapes of the characters are refined using the intensity
values.
Similarly, Hua et al. [143] first selected some high contrast
frames (HCFs) by calculating the percentage of dark pixels
around text blocks to apply averaging on them. To solve
problems when only part of text is readable or clear in HCFs,
a text area decomposition method [144] was used to divide
a text block into words on the averaging frame of the HCFs.
They searched for the high contrast text blocks (HCBs) and
averaged all the corresponding HCBs to obtain a clearer block.
Finally, theose clearer blocks are merged to form a whole text
region.
Some other multi-frame averaging methods for image
enhancement exist. For example, a combination of image
interpolation and multi-frame averaging can be applied to
enhance the image. By first using bilinear interpolation to
improve the resolution of each text region, a new text region
can be obtained by averaging 30 text regions [106] or all the
interpolated images [119]. In Wang and Weis method [122],
a text region is first segmented into smaller blocks. The
corresponding clearer blocks are selected by comparing the

2761

number of corners in a sliding window with the average corner


number of the text region. Then, these selected clearer blocks
are averaged to form the entire text region. Finally, linear
interpolation is used to improve the resolution of the averaged text regions before converting them to a binary image.
Li and Doermann [142] also used multi-frame averaging for
caption text region enhancement, and they used POCS (projection onto convex sets) for scene text. In POCS enhancement,
each text block is first bilinearly interpolated to a required
image grid in POCS. Then, the residual is calculated and
back-projected. Following this way, the residual is iteratively
computed until a stopping criterion is satisfied.
The main issue with multi-frame averaging is that the
averaging process may blur the character edges and cause
low contrast because the selected text regions may contain blurred and unclear text. In such cases, the multiframe averaging results are challenging for word recognition.
Correspondingly, some other integration strategies, e.g., timebased minimum/maximum pixel value searching, have been
investigated to improve text recognition.
Generally speaking, pixels that represent text in video vary
only slightly, while background pixels often change radically
over time. Based on this observation, a time-based minimum/maximum pixel value search method can strengthen the
contrast between text and its background [94], [98], [121],
[145], [146]. Specifically, Zhou et al. [121] adopted a minimum/maximum operation with text block polarity. A group
of identical text blocks that possess the same text polarity in
a coordinate point can be used with the minimum operation;
otherwise, the maximum operation is utilized. Unfortunately,
a time-based minimum/maximum pixel value search method
is easily affected by noise, especially when there are few
background changes or when the background color is similar
to the text stroke color.
To address this problem, Mi et al. [120] proposed a multiframe edge integration method. Edge integration is achieved
by 30 consecutive text regions. A Canny edge detector is first
applied to generate edge maps for each text region. Then, the
statistical information of appearance possibility of of being
part of an edge is computed for each pixel to construct an
Edge Distribution Histogram (EDH). A stroke mark map is
then obtained from the EDH by thresholding. Dilation and
erosion algorithms are then applied to the stroke mark map to
construct a final stroke mark map.
Similarly, Yi et al. [147] constructed a text-intensity map
to measure the clarity of the text and then selected the clear
text blocks for integration. They integrated the blocks by using
average and minimum integrations for the text and background
pixels of the image, respectively. Here, Otsus binarization was
employed to identify text and background pixels.
For a list of tracked text blocks, a Boolean And operation
can enhance the text pixels [104], [123]. In [123], candidate
caption pixels are first extracted from each clip containing the
same caption based on the estimated caption color information.
Then, an And operation is used to filter out inconsistent false
positive pixels.
2) Recognition Results Fusion: Recognition results fusion
simply combines the text recognition results of different

2762

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 6, JUNE 2016

TABLE I
S UMMARY OF T EXT T RACKING , T RACKING BASED D ETECTION AND T RACKING BASED R ECOGNITION
M ETHODS A PPLIED W ITH T EXT C ATEGORY: C APTION T EXT OR / AND S CENE T EXT

frames into one final character/text, which can generally


improve the overall recognition performance.
On the one hand, Mita and Hori [148] proposed a method
to select the best results of individual characters. They first
divided the video into two-second segments and captured
frames in a constant time interval. Text is extracted and
recognized for each captured frame. Then, all the individual
characters are grouped by their position and the size of their
rectangular bounding boxes. In the next step, the certainty
value of each character group is calculated by the number
of characters in the group and that value is used to select
groups. Groups with high certainty values are selected and any
groups that have a large overlap region with selected groups
are rejected. Finally, the character code in each selected group
is determined by evaluating the obtained code number or using
a certain value code (which is a median value in the code
group). This proposed method is based on the assumption that
video text has a single intensity and color, has high contrast,
and is static.
On the other hand, Rong et al. [125] constructed two fusion
methods for video text recognition. The first one employs a
Majority Voting model, which utilizes category labels of scene
text character (STC) prediction in all frames and chooses the
one with the highest frequency as the final result. The second
one uses a CRF model to fuse multi-frame STC prediction
scores under lexical constraints in which the STC prediction
scores are used as node features and the lexical frequency of
neighboring STCs are used as edge features. The experiments
in [125] show that this tracking based recognition technique
is effective. Similarly, Greenhalgh and Mirmehdi designed a
strategy to temporally fuse of text results across consecutive
frames [149]. In their method, individual OCR words are
first compared from frame to frame based on size. Then,
the results of the ten most recent detections are combined,
and a histogram of OCR results is created for each tracked

word weighted by the recognition confidence. At each frame,


the histogram result with the highest value determines the
recognized word for that frame.
D. Summary
In the preceding sections of this paper, text tracking, tracking based text detection, and tracking based text recognition
methods are surveyed within different categories. In each
categorization, related methods are further summarized into
several subgroups, with each subgroup containing the methods
with similar procedures. For example, Section IV-A1, Tracking with Template Matching, describes techniques for feature representation, and similarity calculation and (matching)
search. Then, in Feature Representation of Tracking with
Template Matching, techniques about features with different
types (e.g., intensity-based features, projection profile based
features, advanced features (SIFT and SURF)) are further
surveyed.
Now, from a different perspective, we disambiguate all
the related works (in Section IV-A, IV-B and IV-C) that
track, detect or recognize caption text or/and actual scene text
in video, and simply summarize these methods in Table I.
Historically, the bulk of the proposed methods are intended
to track, detect and recognize caption text; however, more
recently, an increasing number of techniques have been
designed to address scene text. More details can be found
in Section IV-A, IV-B and IV-C.
V. E VALUATION AND DATASETS
In this section, we first summarize the acknowledged protocols for evaluating text detection, tracking and recognition
algorithms, some of which follow conventions from text detection and recognition in images and some of which derive from
object tracking evaluation conventions. Then, we thereafter

YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY

collect and analyze some benchmark datasets available for


testing video text extraction.

2763

MOTP is defined as the spatial-temporal overlap between


the reference trajectory and the system output trajectory,
(t)

N f rames
Nmapped



A. Evaluation Protocols
The evaluation protocols for text detection, tracking and
recognition in video have been presented in [21], [42], [89],
[101], [121], [124], [130], [133], and [150][154]. Here, we
summarize several mainstream evaluation methods.
Detection Evaluation Protocols: There are three criteria for measuring text detection results in video. The first
criterion is speed, which indicates the average processing
time per frame in text detection. The second criterion is
precision, which evaluates the percentage of text regions
correctly detected compared to the text region claimed as
follows,
number of correctly detected text regions
.
Precision =
number of detected text regions

(1)

The last criterion is recall, which is defined as the ration


of the text regions correctly detected to the ground truth
text regions. The corresponding calculation formulation is
with
number of correctly detected text regions
Recall =
.
(2)
number of ground-truth text regions
These detection protocols have been described in detail
in [155].
Tracking Evaluation Protocols: In video, the outputs from
text detection and tracking are all the text positions. Thus,
the evaluation protocols can be divided into two categories.
The first is that tracking results are regarded as detection
results and can be evaluated by the detection evaluation
protocols as described above. The second is based on a
mapping list M of ground truth and tracking results correspondences at the frame level. A mapping list Mt maximizes the sum of overlap between the tracking results and
ground truth at frame t. Multiple Object Tracking Precision (MOTP), Multiple Object Tracking Accuracy (MOTA),
and the Average Tracking Accuracy (ATA) are used to evaluate
the performance of tracking, where the notations used are as
follows [152]:
- G i denotes the i th ground truth text at the sequence
(t )
level and G i denotes the i th ground truth text in
frame t.
- Di denotes the i th tracked text at the sequence level and
(t )
Di denotes the i th tracked text in frame t.
- NG and N D denote the number of unique ground truth
text instances and the number of unique tracked text
instances in a given sequence, respectively. Uniqueness
is defined by text IDs.
- N f rames is the number of frames in the sequence.
)
is the number of frames in which the ground- N (tf rames
truth text or tracked text exist in the sequence.
- Nmapped is the number of mapped ground-truth
(t )
and tracked text pairs and Nmapped the number
of mapped ground-truth and tracked text pairs in
frame t.

M OT P =

N f
rames
t =1

(t)

(t)

(t)

|G i Di |

t =1

i=1

(t)

|G i Di |


.

(3)

(t )
Nmapped

MOTA is defined as the number of false negatives, false


positives and ID switches in the system output trajectory for
a given reference ground-truth trajectory,
N f
rames

M OT A = 1

t =1

( f n t + f pt + i d_swt )
N f
rames
t =1

(4)

NG(t )

(t )

where f n t , f pt , i d_swt and NG refer to the number of false


negatives, false positives, ID switches, and ground-truth words
at frame t respectively.
The Sequence Track Detection Accuracy (STDA) is a
measure of the tracking performance over all the text in the
sequence and is calculated by means of the sequence level
mapping M as,
 (t) (t) 
N f
rames
|G i Di |
Nmapped
(t)

|G (t)
i Di |
t =1
.
(5)
ST D A =
N(G i Di =)
i=1

The ATA is the normalized STDA per text and is defined as,
AT A = 

ST D A
NG +N D
2

.

(6)

Recognition Evaluation Protocols: For text recognition


(in both images and video), the recognition performance is
always measured by the accuracy of word recognition. The
word recognition accuracy (WRA) is simply defined as the
percentage of the recognized text is correct, i.e.,
number of words correctly recognized
WRA =
.
(7)
number of ground-truth words
The recent ICDAR 2015 Robust Reading Competition
used multiple-object-tracking based metrics (MOTP, MOTA
and ATA) for end-to-end video text recognition. The evaluation
framework is similar to Tracking Evaluation Protocols not in
detection results, but in recognized words, where an estimated
word is considered a true positive if its spatial intersection
over union with a ground truth word is larger than 0.5 and the
word recognition is correct [77].
B. Datasets
Table II lists some commonly used datasets for text detection, tracking, segmentation and recognition in video or video
frames and summarizes their features including text categories,
sources, tasks and languages. Some sample frames from
different videos are shown in Fig.4.
The MoCA dataset [95] works for real-time text segmentation and recognition in video. Videos in this dataset are

2764

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 6, JUNE 2016

TABLE II
B ENCHMARK D ATASETS AVAILABLE FOR T EXT V IDEO E XTRACTION , W HERE IN THE TASK C OLUMN , D, T, S
AND R R ESPECTIVELY R EPRESENT D ETECTION , T RACKING , S EGMENTATION AND R ECOGNITION

Fig. 4. Sample frames of text from the MSRA-I, MoCA, Merino-Gracia, Minetto, ICDAR13 / ICDAR15 datasets. (a) MSRA-I Embedded Caption Text
(Graphic Text) Dataset. (b) MoCA Embedded Caption Text (Graphic Text) Dataset. (c) Merino-Gracia Scene Text Dataset. (d) Minetto Scene Text Dataset.
(e) ICDAR13 / ICDAR15 Scene Text Dataset.

low resolution, and graphic text (embedded caption text) in


most videos moves at high speeds.
The Merino and Mirmehdi [116] and Merino-Gracia
datasets [138] are intended for scene text tracking in video.

The Merino-Gracia dataset contains scenes with text in a


city environment and suffers from hand-held erratic camera
motion, as well as from blur and a large degree of perspective
distortions. Similarly, text in the Merino dataset undergoes

YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY

2765

TABLE III
S TATE - OF - THE -A RT P ERFORMANCE ON AVAILABLE V IDEO T EXT D ATASETS , W HERE P RECISION AND R ECALL A RE FOR T EXT D ETECTION ,
MOTPm MOTA AND ATA A RE FOR T EXT T RACKING , AND D AND R M EAN T RACKING I S A PPLIED FOR M EASURING D ETECTION
AND R ECOGNITION , R ESPECTIVELY. T HE I NTERVAL VALUE I S E STIMATED F ROM THE F IGURE IN THE C ORRESPONDING R EFERENCE

occlusions in highly textured scene backgrounds, complex


backgrounds, significant perspective changes, and viewpoint
changes.
The Minetto et al. [118] and YouTube Video Text
datasets [89] are used for text detection, tracking and recognition in video. Some videos have several text regions that
are sometimes affected by natural noise, distortion, blurring,
substantial changes in illumination and occlusion. Specifically,
the YouTube Video Text dataset contains 30 videos collected
from YouTube. The text contents can be further divided into
two categories, graphic text (e.g., captions, songs title, and
logos) and scene text (e.g. street signs, business signs, and
words on t-shirts).
The ICDAR 2013 dataset (Robust Reading Competition
Challenge 3: Text in Videos) [153] is used to evaluate the
performance of video scene text detection, tracking and recognition. This database includes 28 video sequences, of which
13 videos are for training, and the rest are for testing. These
videos cover different scripts and languages (Spanish, French,
English and Japanese) and were captured with different types
of cameras. More recently, the ICDAR 2015 Robust Reading
Competition [77] released an updated version of the ICDAR
2013 video dataset. The ICDAR 2015 dataset includes a
training set of 25 videos (13450 frames in total) and a test
set of 24 videos (14374 frames in total). The dataset was
collected by organizers from different countries and includes
text in different languages. The video sequences correspond
to 7 high-level tasks in both indoor and outdoor scenarios.
Moreover, 4 different cameras are used for capturing different
sequences.
In addition, there are several datasets composed of video
frames. The TREC [156], MSRA-I [150] and SVT [82]
datasets include graphic text (embedded caption text) or scene
text in key video frames. The TREC dataset is mainly used
for video text searching, while the MSRA-I dataset is used
to detect horizontal graphic text and scene text. The SVT
dataset is used for scene text detection and recognition in street
views.
We also summarize the state-of-the-art performance of the
above video datasets (not including datasets with video frames)
in Table III. As shown in Table III, the performance of text
detection, tracking and recognition in complex videos such
as the ICDAR 2015 and the YouTube Video Text datasets is
limited. For example, on the ICDAR 2015 scene text video
dataset, the ATA results for detection and recognition are only
45.18 and 41.84, respectively [77] although there has been

steady progress in recent years. All these lead to numerous


open research opportunities (see the discussions of challenges
and future directions in Section VII).
VI. A PPLICATIONS
In the last 10 years as the use of smart-phones, the availability of intelligent wearable devices and social networks usage
has exploded, there are numerous applications for video text
extraction techniques and systems. Here, we briefly describe
two major categories of these applications7: video understanding and retrieval, and reading in the wild.
A. Video Understanding and Retrieval
Text in video plays an important role in semantic-based
video analysis, indexing and retrieval [5], [34], [37], [75],
[95], [104], [107], [130], [146], [157][164]. A variety of
research efforts have been conducted in this field. Specifically,
Lienhart et al. [160] employed text detection and recognition
to record the broadcast time and date of commercials, and
sequentially help the people to check whether their clients
commercials have been broadcast on the scheduled television
channel within a specified time. Zhang and Chang [158]
used superimposed caption detection and recognition to detect
events in videos of baseball games. A methodology based on
text recognition is proposed and used in a video annotation
and indexing system in [75]. Moreover, this methodology has
been integrated into the European Automatic Segmentation
and Semantic Annotation of Sports Videos (ASSAVID) project
to create and search sports video annotations. Overall, the most
obvious application for video understanding and retrieval is to
help us obtain more accurate and richer video information
in the future from the Internet for whatever subjects we are
interested in.
B. Reading in the Wild
The rapid advancement of scene text extraction technology
has promoted the emergence and development of various
practical applications for reading text in the wild such as
assisting for visually impaired people [96], [97], [114], [115],
[165][170], real-time translation [20], [108], [171], [172],
user navigation [118], [173], traffic monitoring [174], [175]
and driving assistance systems [176], [177].
7 Other video-related applications can be referred to some discussions
in [1], [2], [9], and [10].

2766

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 6, JUNE 2016

TABLE IV
C HALLENGES OF T EXT D ETECTION , T RACKING AND R ECOGNITION FOR S CENE T EXT AND E MBEDDED C APTION IN V IDEO

Assisting Visually Impaired People: Real-time text extraction technology in the wild can help visually impaired
people understand scenes in their surrounding environment.
A handheld PDA-based system was developed to help blind
people accomplish daily tasks [167]. The system can be
viewed as a loop that includes the user taking a snapshot,
text/picture detection, optical character recognition, text-tospeech synthesis, and feedback to the user, until it reaches a
useful output. Ezaki et al. [168] described a system equipped
with a PDA, a CCD-camera and a voice synthesizer to
assist visually impaired persons by detecting text objects
from natural scenes and transforming them into voice signals.
Similarly, a guide dog system [165], a portable text reading
system [169], an autonomous robot [96], [97] and a wearable
camera system [114], [115] were all designed and constructed
for visually impaired people.
Real-Time Translation: Text extraction is also important for translation purposes (e.g., for tourists or robots).
Haritaoglu [171] developed an automatic sign/text language
translation system for foreign travelers. Detected text in a
scene be translated into a travelers native language by this
system. Shi and Xu [172] presented a wearable translation
robot that can automatically translate multiple languages in
real time. The robot consists of a camera mounted on reading
glasses together with a head-mounted display used as the
output device for the translated text. A mobile augmented
reality (AR) translator on a mobile phone using a smart-phone
camera and touchscreen is described in [20] and [108].
User Navigation: User navigation can pinpoint a users
position and provide routes to a destination in real time.
Aoki et al. [173] designed a small camera mounted on a baseball cap intended for user navigation in a scenic environment.
A street view navigation system was also constructed in [118].
Traffic Monitoring and Driving Assistance Systems: In general, the most effective way to monitor traffic is to obtain
and track license plates. Cui and Huang [174] proposed
a Markov Random Field (MRF) model-based method to
extract characters from the license plates of moving vehicles.
Park et al. [175] used neural networks to identify car

license plates. Another application of video text recognition


is in systems intended to provide driving assistance systems.
For example, Wu et al. [176], [177] proposed a method that
incrementally detects text on road signs from natural scene
videos applicable for driving assistance systems. Greenhalgh
and Mirmehdi proposed a novel system to automatically
detect and recognize text and symbols on traffic signs using
MSER-based text detection and multi-frame integration based
text recognition [149], [178].
Others: Text extraction technology is also used to read realworld text such as store names, advertisements and so on.
Ltourneau et al. [179], [180] developed an autonomous
mobile robot equipped with a Pentium 233 MHz and a
Sony EVI-D30 pan-tilt-zoom camera to read such real-world
text. The robot acquires an image, extracts text regions,
and recognizes them as symbols, characters, and words.
Merino-Gracia et al. [117] designed a mobile head-mounted
device encased in an ordinary flat-cap hat to recognize text in
natural scenes. The main parts of the device are an integrated
camera and audio webcam together with a simple remote
control system, all connected via a USB hub to a laptop.
VII. D ISCUSSIONS
As described in previous surveys and papers [1], [9], [181],
there are numerous challenges for text detection and recognition in images and videos. In this section, we highlight some
prominent challenges related specifically to video. We also
specifically discuss potential directions of video text extraction
technologies for future research from the literature.
A. Challenges
Generally speaking, both detecting and recognizing text in
video have some challenges in common such as robustness
to background complexity, text degradation and distortion,
text variations, and moving objects. Here, we categorize these
challenges into three major groups, i.e., background-related,
foreground-related, and video-related challenges.
First, the various challenges of video text extraction are
summarized in Table IV, which also presents the unique and

YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY

shared challenges between scene text and embedded caption


extraction in video.
Second, because previous references from the literature
have broadly discussed the background and foreground-related
challenges for text detection and recognition in images and
videos [1], [6], [10], this survey henceforth emphasizes several
prominent video-related challenges for video text extraction.
These video-related challenges are generally derived from the
following special characteristics of dynamic videos.
Low resolution: Compared with document images, video
frames usually have a lower resolution.
Compressed degradation: Most videos are stored with
lossy compression which may cause text color bleeding.
Low bit-rate video compression can also result in loss of
contrast between text and backgrounds. Moreover, video
compression may create additional artifacts.
Moving text and objects: Text in video sometimes moves
in complex non-linear ways, such as when the camera is
zooming in or out, rotating or when scene text moves
arbitrarily. In addition, moving objects in video typically
exhibit motion blur, and the entire backgrounds of consecutive frames may vary widely.
Real-time processing: In general, video plays at a rate of
approximately 25 or 30 frames per second. Thus, video
text detection, tracking and recognition algorithms need
high computational efficiency. Performing at such speeds
is a serious challengne for most current methods and
systems in the literature.
Finally, compared with general object tracking [90], [182],
text tracking has several specific challenges, because text
objects are quite different from common visual objects in
video. Generally speaking, multi-oriented or multi-scaled text
and text blur make text tracking more difficult than tracking
other objects. Specifically, compared to general visual object
tracking, the difficulty of text tracking is mainly reflected in
three aspects. First, the text background is often complex,
especially for scene text and embedded caption text. Second,
some parts of the background can be very similar to the text
and text objects are usually small, both of which reduce the
accuracy of text detection and tracking. Third, the motion
of text in video is complex and motion prediction is very
challenging. For example, scene text in video is almost always
exhibits a variety of distortions (e.g., color, scale, skewed,
perspective, curved and surface). Some samples are shown in
the ICDAR 2015 Video Scene Text Dataset.
Specifically, we want to emphasize two major technical problems in these challenges for text detection and
recognition in both images and video: distorted text detection and recognition, and multilingual text detection and
recognition.
1) Distorted Text Detection and Recognition: Skew (multioriented), curved, perspective and unaligned distortions
are almost always found with scene text in images and
videos [48], [49], [183] (see Fig.1(b), 4(c) and 4(e)). These
distortions usually constitute serious challenges for text detection and recognition; most current research focuses mainly on
horizontal or near-horizontal text [68], [72]. One fundamental
difficulty in detecting distorted text is that the text line

2767

alignment feature can no longer be used to regularize the


text/word construction.
Skew Distortion (Multi-Orientation): A large number of
published scene text detection and recognition methods focus
mainly on (near) horizontal text detection in scene images;
only a very few methods have been proposed to detect text
with skew distortions. Some researchers have used coarse-tofine grouping strategies with various priors to detect multiorientation text [68], [70], [72]. Others have designed specific
orientation-invariant text features and combined SIFT features
to detected skewed and curved text [184]. More specifically,
Yao et al. proposed a multi-orientation scene text detection
system by bottom-up grouping and top-bottom pruning but
with numerous empirical rules and parameters [70] and then
extended this work to design an end-to-end multi-orientation
scene text recognition system [72]. Kang et al. used higher
order correlation clustering to partition MSERs into text line
candidates and proposed a robust multi-orientation scene text
detection system [66]. To ease the influence of empirical rules
as much as possible, Yin et al. constructed multi-orientation
scene text detection systems with adaptive clustering algorithms [6], [68].
Perspective Distortion: Similarly, there are few existing
research efforts for detecting and recognizing text distorted
by perspective, although scene text in images and videos
typically exhibits perspective distortions. However, there are
numerous methods in the literature for rectifying perspective
distortions for camera-based document images [10]. The key
issue for most perspective correction methods is vanishing
point detection. The methods of vanishing point detection
can be categorized into three groups: direct, indirect, and
hybrid methods. The direct methods perform analysis and
calculation directly on the image pixels, such as projection
analysis from a perspective view to detect the horizontal
vanishing point [185]. The indirect methods convert the original space into a clue space. Most indirect methods involve
extracting multiple straight or illusory lines and using model
fitting to vote vanishing points [48], [49], [186][188]. The
hybrid methods combine the results from clustering (on illusory lines) and projection analysis (on representative image
pixels) [183], [189].
In summary, one possible direction to address all skew,
curved, perspective and unaligned distortions in text is to
utilize and integrate a variety of region and edge detectors
from computer vision such as MSERs [60], SWT [50], Edge
Boxes [190], and Aggregate Channel Features [191] and to
use effective learning techniques such as AdaBoost, CRFs,
Random Forests, and CNNs to identify distorted text regions.
For skewed and unaligned text, we can simply use the
common character/word classifiers after text de-skewing and
realigning. The difficulty in recognizing curved or perspective
text is that it is challenging to prepare and annotate sufficient
distorted training samples and to construct robust character and
word classifiers. One research direction for perspective text
recognition is to train classifiers on data with a variety of perspective distortions automatically produced by a synthetic text
generation engine [85]. One possible solution for curved word
recognition is to use segmentation-based word recognition

2768

methods, which construct strong character classifiers and


search the target word by using optimization techniques with
some priors (e.g., a lexicon) [192].
2) Multilingual Text Detection and Recognition: Multilingual text occurs frequently in scene images and web videos.
The most common pairwise languages involve English and
in conjuncton with a native language. Currently, most text
detection and recognition techniques are specifically designed
for default English text, and few research efforts have been
conducted on reading multilingual text [193].
In general, text in different languages has a variety of
characteristics and appearances. Hence, a text detector trained
on one language may fail to locate text in another language
in some cases. For example, Yin et al. trained their scene text
detector on an English dataset and a multilingual (Chinese
and English) dataset, respectively, and tested it on ICDAR
2011 Robust Reading Competition testing set obtaining quite
different performances [6]. Possible solutions for multilingual
text detection include designing language-free text feature
extractors, e.g., SWT [50] and Stroke Feature Transform [71],
or constructing language-free text detectors with text-sensitive
vision (edge or region) features, e.g., MSER [60] and
Edge Boxes [190].
Multilingual character/word recognition is also a challenging problem because of the large number of character/word
classes, the complexity of character structures, the similarity
of character appearances, and font variations. As a result, most
current OCR (text recognition) technologies first determine the
language class (script identification [194]) and subsequently
use a corresponding language-specific engine to recognize
text. For scene images and complex videos, some similar
but more complex strategies can be utilized. For example,
Shivakumara et al. integrated gradient spatial and structural features to classify video text blocks into different
languages [195].
B. Future Directions
In the past two decades, a variety of research achievements
have been published in the literature. However, because of
the wide variety challenges, current technologies and systems
for video text detection, tracking and recognition have limited
performance. Hence, typical grand challenges and recent new
applications (as described above) are accompanies by many
open issues, and numerous research opportunities for both
technologies and applications. In addition to the discussions
mentioned above (Section VII-A), research directions specifically for video text extraction (e.g., text tracking, and tracking
based detection and recognition) are highlighted here, and possible novel applications of the technology are also discussed.
1) Text Tracking in Complex Videos: In the current literature, major research efforts concerned with video text tracking
are mainly based on template matching with simple features
such as the difference of intensity values [106], cumulative histograms [97], [99], vertical and horizontal projection
profiles [98], and EMS-based features [108]. Tracking with
template matching can obtain good performance for layered
caption text; however, the backgrounds of consecutive frames

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 6, JUNE 2016

may wary widely, which is challenging for tracking scene


text or embedded caption text effectively. Afterwards, some
robust features such as SIFT [101], [102], SURF [104] and
MSER [112] are extracted and applied to video text tracking.
Nevertheless, extracting and matching these features is a
time-consuming process. Moreover, to address low contrast,
low resolution, complex backgrounds and text movements,
although CAMSHIFT [36], [137], particle filtering [114],
[115], [117], [118], optical flow [136] and UKF [138] have
been applied to track text frame by frame, the accuracy of
text tracking has not increased significantly.
The tracking-by-detection method provides a good framework for tracking. Detection outputs are first used as constraints to optimize the text search. The optimized tracking
results are then utilized to guide text detection in subsequent
frames. The combination of these two processes is expected to
obtain high tracking accuracy. Several text tracking methods
that use tracking-by-detection have proposed, but they have
limited performance, primarily attributable to two aspects.
First, the text detection results are not robust. Although detection results can guide tracking, false positives reduce tracking
accuracy. Second, the features used by current tracking algorithms are not suitable for text. Therefore, to accurately track
text in complex videos using tracking-by-detection, future
work in this area needs to improve the text detection and
investigate more discriminative features to represent text in
tracking. Another future issue is to combine different tracking
strategies for text tracking [140], [141].
2) Unified Frameworks for Tracking Based Text Detection
and Recognition: In the literature, there are several text extraction methods that utilize spatial and temporal information
(i.e., detecting and recognizing text using multiple frames).
However, these two tracking based components (detection
and recognition) are handled separately (tracking based text
detection [108], [112], [118] or tracking based text recognition [125]) and do not uniformly integrate detection, tracking,
recognition and their interactions. As described in DETR
(see Fig.2), text detection, tracking and recognition can share
information and their interactions should be taken into account.
Therefore, a more important issue of video text extraction
is to investigate unified frameworks for tracking based text
detection and recognition [141].
3) Robust Text Reading in the Wild: With the rapid advancement of text extraction technology, robustly reading text in the
wild is an inevitable trend because it improves the convenience
of peoples daily lives. This field includes tasks such as
assistance for visually challenged persons, real-time translation
services, user navigation, and so on. The core technology
involves extracting text from scenes. To meet these demands,
it is vital to solve the problems of varied types of distortions,
multilingual text, and complex backgrounds. As summarized
in Section VI-B, a variety of systems for reading text in
the wild have been constructed; however, most of these
are prototype systems. There are still no effective methods
(for text detection, tracking and recognition) to solve these
issues.
Possible solutions should fully utilize text tracking
techniques and adopt tracking based detection and

YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY

recognition methods. Considering that wild text will


likely continue to exist for some time, it is possible to
make use of temporal redundancy to improve the accuracy
of text extraction, not just extract text from an individual
frame. In other words, combining text detection and text
recognition with text tracking algorithms can greatly enhance
the performance of text extraction.
4) Text Recognition and Retrieval for Web Videos:
As described in Section 1 (Introduction), understanding and
retrieval of video on the web (big visual data) is becoming
an increasingly important issue given the explosive growth
of online social media. A variety of research efforts have
verified that fusing textual (including video text) and visual
representations can improve video retrieval systems. However,
web videos are generally characterized by a high degree of
diversity including creator, content, style, production, quality,
encoding, language, and so on, all of which raise the challenge
level when extracting video text. Moreover, text in web videos
is almost always composed of embedded caption text or scene
text, which are even more challenging because of their complex foregrounds and backgrounds. Additionally, the massive
and ever-increasing volume of web videos also requires realtime processing for text recognition and retrieval.
All these challenges lead to numerous open issues, research
opportunities, and wide applications in both academia and
industry. Currently, one expected direction will be for
researchers to provide more accurate video text recognition (OCR) results on the TRECVID Multimedia Event
Detection database.
VIII. S UMMARY
This paper presents an extensive survey within a unified
framework of text detection, tracking and recognition in video,
where text tracking, tracking based detection, and tracking
based recognition are specifically summarized and highlighted.
We also describe the available datasets, evaluation protocols,
technological applications, and grand challenges of video
text extraction. More importantly, major technological trends
and directions are discussed in detail, with the intention of
identifying the open issues and potential directions for future
research from the current research literature.
ACKNOWLEDGMENTS
The authors are grateful to the associate editor Dr. Dimitrios
Tzovaras and the anonymous reviewers for their constructive
comments.
R EFERENCES
[1] Q. Ye and D. Doermann, Text detection and recognition in imagery:
A survey, IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 7,
pp. 14801500, Jul. 2015.
[2] Y. Zhu, C. Yao, and X. Bai, Scene text detection and recognition:
Recent advances and future trends, Frontiers Comput. Sci., vol. 10,
no. 1, pp. 1936, Feb. 2016.
[3] T. Mei, Y. Rui, S. Li, and Q. Tian, Multimedia search reranking: A literature survey, ACM Comput. Surv., vol. 46, no. 3,
pp. 38:138:38, Jan. 2014.
[4] P. Over et al., TRECVID 2014An overview of the goals, tasks,
data, evaluation mechanisms and metrics, in Proc. TRECVID, 2014,
p. 52.

2769

[5] Y. Zhong, H. Zhang, and A. K. Jain, Automatic caption localization


in compressed video, IEEE Trans. Pattern Anal. Mach. Intell., vol. 22,
no. 4, pp. 385392, Apr. 2000.
[6] X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao, Robust text detection in
natural scene images, IEEE Trans. Pattern Anal. Mach. Intell., vol. 36,
no. 5, pp. 970983, May 2014.
[7] J. Dalton, J. Allan, and P. Mirajkar, Zero-shot video retrieval using
content and concepts, in Proc. ACM Int. Conf. Inf. Knowl. Manage. (CIKM), 2013, pp. 18571860.
[8] G. Nagy, Twenty years of document image analysis in PAMI, IEEE
Trans. Pattern Anal. Mach. Intell., vol. 22, no. 1, pp. 6384, Jan. 2000.
[9] K. Jung, K. I. Kim, and A. K. Jain, Text information extraction
in images and video: A survey, Pattern Recognit., vol. 37, no. 5,
pp. 977997, May 2004.
[10] J. Liang, D. Doermann, and H. Li, Camera-based analysis of text
and documents: A survey, Int. J. Document Anal. Recognit., vol. 7,
nos. 23, pp. 84104, 2005.
[11] J. Zhang and R. Kasturi, Extraction of text objects in video documents: Recent progress, in Proc. IAPR Workshop Document Anal.
Syst. (DAS), Sep. 2008, pp. 517.
[12] H. Zhang, K. Zhao, Y.-Z. Song, and J. Guo, Text extraction
from natural scene image: A survey, Neurocomputing, vol. 122,
pp. 310323, Dec. 2013.
[13] X. Qian, G. Liu, H. Wang, and R. Su, Text detection, localization,
and tracking in compressed video, Signal Process., Image Commun.,
vol. 22, no. 9, pp. 752768, Oct. 2007.
[14] S. Antani, D. Crandall, and R. Kasturi, Robust extraction of text
in video, in Proc. Int. Conf. Pattern Recognit. (ICPR), Sep. 2000,
pp. 831834.
[15] K. Elagouni, C. Garcia, and P. Sbillot, A comprehensive neuralbased approach for text recognition in videos using natural language
processing, in Proc. ACM Int. Conf. Multimedia Retr. (ICMR), 2011,
p. 23.
[16] Q. Ye, J. Jiao, J. Huang, and H. Yu, Text detection and restoration
in natural scene images, J. Vis. Commun. Image Represent., vol. 18,
no. 6, pp. 504513, Dec. 2007.
[17] A. K. Jain and B. Yu, Automatic text location in images and video
frames, Pattern Recognit., vol. 31, no. 12, pp. 20552076, Dec. 1998.
[18] V. Y. Mariano and R. Kasturi, Locating uniform-colored text in video
frames, in Proc. 15th Int. Conf. Pattern Recognit., vol. 4. Sep. 2000,
pp. 539542.
[19] M. Len and A. Gasull, Text detection in images and video
sequences, in Proc. IADAT Int. Conf. Multi-Media, Image Process.
Comput. Vis., Madrid, Spain, Mar. 2005, pp. 15.
[20] M. Petter, V. Fragoso, M. Turk, and C. Baur, Automatic text detection
for mobile augmented reality translation, in Proc. IEEE Int. Conf.
Comput. Vis. Workshops (ICCV), Nov. 2011, pp. 4855.
[21] M. R. Lyu, J. Song, and M. Cai, A comprehensive method for
multilingual video text detection, localization, and extraction, IEEE
Trans. Circuits Syst. Video Technol., vol. 15, no. 2, pp. 243255,
Feb. 2005.
[22] M. Anthimopoulos, B. Gatos, and I. Pratikakis, A hybrid system
for text detection in video frames, in Proc. 8th IAPR Int. Workshop
Document Anal. Syst. (DAS), Sep. 2008, pp. 286292.
[23] P. Shivakumara, W. Huang, and C. L. Tan, An efficient edge based
technique for text detection in video frames, in Proc. 8th IAPR Int.
Workshop Document Anal. Syst. (DAS), Sep. 2008, pp. 307314.
[24] P. Shivakumara, T. Q. Phan, and C. L. Tan, Video text detection based
on filters and edge features, in Proc. IEEE Int. Conf. Multimedia
Expo (ICME), Jun./Jul. 2009, pp. 514517.
[25] P. Shivakumara, T. Q. Phan, and C. L. Tan, A gradient difference based
technique for video text detection, in Proc. 10th Int. Conf. Document
Anal. Recognit. (ICDAR), Jul. 2009, pp. 156160.
[26] P. Shivakumara, W. Huang, and C. L. Tan, Efficient video text
detection using edge features, in Proc. 19th Int. Conf. Pattern
Recognit. (ICPR), Dec. 2008, pp. 14.
[27] T. Q. Phan, P. Shivakumara, and C. L. Tan, A Laplacian method
for video text detection, in Proc. 10th Int. Conf. Document Anal.
Recognit. (ICDAR), Jul. 2009, pp. 6670.
[28] D. S. Guru, S. Manjunath, P. Shivakumara, and C. L. Tan, An eigen
value based approach for text detection in video, in Proc. 9th IAPR
Int. Workshop Document Anal. Syst., 2010, pp. 501506.
[29] M. Anthimopoulos, B. Gatos, and I. Pratikakis, A two-stage scheme
for text detection in video images, Image Vis. Comput., vol. 28, no. 9,
pp. 14131426, Sep. 2010.

2770

[30] A. Jamil, I. Siddiqi, F. Arif, and A. Raza, Edge-based features for


localization of artificial Urdu text in video images, in Proc. Int. Conf.
Document Anal. Recognit. (ICDAR), Sep. 2011, pp. 11201124.
[31] N. Sharma, P. Shivakumara, U. Pal, M. Blumenstein, and C. L. Tan,
A new method for arbitrarily-oriented text detection in video, in Proc.
10th IAPR Int. Workshop Document Anal. Syst. (DAS), Mar. 2012,
pp. 7478.
[32] P. Shivakumara, R. P. Sreedhar, T. Q. Phan, S. Lu, and C. L. Tan,
Multioriented video scene text detection through Bayesian classification and boundary growing, IEEE Trans. Circuits Syst. Video Technol.,
vol. 22, no. 8, pp. 12271235, Aug. 2012.
[33] P. Shivakumara, T. Q. Phan, S. Lu, and C. L. Tan, Gradient vector flow
and grouping-based method for arbitrarily oriented scene text detection
in video images, IEEE Trans. Circuits Syst. Video Technol., vol. 23,
no. 10, pp. 17291739, Oct. 2013.
[34] H. Li, D. Doermann, and O. Kia, Automatic text detection and
tracking in digital video, IEEE Trans. Image Process., vol. 9, no. 1,
pp. 147156, Jan. 2000.
[35] W. Mao, F.-L. Chung, K. K. M. Lam, and W.-C. Sun, Hybrid
Chinese/English text detection in images and video frames, in Proc.
16th Int. Conf. Pattern Recognit., vol. 3. 2002, pp. 10151018.
[36] K. I. Kim, K. Jung, and J. H. Kim, Texture-based approach for text
detection in images using support vector machines and continuously
adaptive mean shift algorithm, IEEE Trans. Pattern Anal. Mach.
Intell., vol. 25, no. 12, pp. 16311639, Dec. 2003.
[37] C. W. Lee, K. Jung, and H. J. Kim, Automatic text detection and
removal in video sequences, Pattern Recognit. Lett., vol. 24, no. 15,
pp. 26072623, Nov. 2003.
[38] Q. Ye, Q. Huang, W. Gao, and D. Zhao, Fast and robust text detection
in images and video frames, Image Vis. Comput., vol. 23, no. 6,
pp. 565576, Jun. 2005.
[39] P. Shivakumara, T. Q. Phan, and C. L. Tan, A robust wavelet transform
based technique for video text detection, in Proc. 10th Int. Conf.
Document Anal. Recognit. (ICDAR), Jul. 2009, pp. 12851289.
[40] P. Shivakumara, T. Q. Phan, and C. L. Tan, New Fourier-statistical
features in RGB space for video text detection, IEEE Trans. Circuits
Syst. Video Technol., vol. 20, no. 11, pp. 15201532, Nov. 2010.
[41] P. Shivakumara, W. Huang, T. Q. Phan, and C. L. Tan, Accurate video
text detection through classification of low and high contrast images,
Pattern Recognit., vol. 43, no. 6, pp. 21652185, Jun. 2010.
[42] Z. Li, G. Liu, X. Qian, D. Guo, and H. Jiang, Effective and efficient
video text extraction using key text points, IET Image Process., vol. 5,
no. 8, pp. 671683, Dec. 2011.
[43] P. Shivakumara, T. Q. Phan, and C. L. Tan, A Laplacian approach
to multi-oriented text detection in video, IEEE Trans. Pattern Anal.
Mach. Intell., vol. 33, no. 2, pp. 412419, Feb. 2011.
[44] P. Shivakumara, A. Dutta, C. L. Tan, and U. Pal, Multi-oriented
scene text detection in video based on wavelet and angle projection boundary growing, Multimedia Tools Appl., vol. 72, no. 1,
pp. 515539, Sep. 2014.
[45] Z. Ji, J. Wang, and Y.-T. Su, Text detection in video frames using
hybrid features, in Proc. Int. Conf. Mach. Learn. Cybern., vol. 1.
Jul. 2009, pp. 318322.
[46] M. Cai, J. Song, and M. R. Lyu, A new approach for video
text detection, in Proc. Int. Conf. Image Process., vol. 1. 2002,
pp. I-117I-120.
[47] E. K. Wong and M. Chen, A new robust algorithm for video text
extraction, Pattern Recognit., vol. 36, no. 6, pp. 13971406, Jun. 2003.
[48] P. Clark and M. Mirmehdi, Recognising text in real scenes, Int. J.
Document Anal. Recognit., vol. 4, no. 4, pp. 243257, Jul. 2002.
[49] G. K. Myers, R. C. Bolles, Q.-T. Luong, J. A. Herson, and H. B. Aradhye, Rectification and recognition of text in 3-D scenes, Int. J.
Document Anal. Recognit., vol. 7, no. 2, pp. 147158, Jul. 2004.
[50] B. Epshtein, E. Ofek, and Y. Wexler, Detecting text in natural scenes
with stroke width transform, in Proc. Int. Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2010, pp. 29632970.
[51] C. Yi and Y. Tian, Localizing text in scene images by boundary
clustering, stroke segmentation, and string fragment classification,
IEEE Trans. Image Process., vol. 21, no. 9, pp. 42564268, Sep. 2012.
[52] T. Q. Phan, P. Shivakumara, and C. L. Tan, Detecting text in the real
world, in Proc. ACM Int. Conf. Multimedia (MM), 2012, pp. 765768.
[53] X. Yin, X.-C. Yin, H.-W. Hao, and K. Iqbal, Effective text localization
in natural scene images with MSER, geometry-based grouping and
AdaBoost, in Proc. Int. Conf. Pattern Recognit. (ICPR), Nov. 2012,
pp. 725728.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 6, JUNE 2016

[54] X. Chen and A. L. Yuille, Detecting and reading text in natural


scenes, in Proc. Int. Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun./Jul. 2004, pp. II-366II-373.
[55] J.-J. Lee, P.-H. Lee, S.-W. Lee, A. Yuille, and C. Koch, AdaBoost for
text detection in natural scene, in Proc. Int. Conf. Document Anal.
Recognit. (ICDAR), Sep. 2011, pp. 429434.
[56] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, End-to-end text
recognition with convolutional neural networks, in Proc. Int. Conf.
Pattern Recognit. (ICPR), Nov. 2012, pp. 33043308.
[57] Z. Zhang, W. Shen, C. Yao, and X. Bai, Symmetry-based text line
detection in natural scenes, in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2015, pp. 25582567.
[58] Y.-F. Pan, X. Hou, and C.-L. Liu, A hybrid approach to detect and
localize texts in natural scene images, IEEE Trans. Image Process.,
vol. 20, no. 3, pp. 800813, Mar. 2011.
[59] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, Conditional random
fields: Probabilistic models for segmenting and labeling sequence data,
in Proc. Int. Conf. Mach. Learn., 2001, pp. 282289.
[60] J. Matas, O. Chum, M. Urban, and T. Pajdla, Robust wide baseline
stereo from maximally stable extremal regions, in Proc. Brit. Mach.
Vis. Conf., vol. 1. 2002, pp. 384393.
[61] L. Neumann and J. Matas, Real-time scene text localization and recognition, in Proc. Int. Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2012, pp. 35383545.
[62] C. Shi, C. Wang, B. Xiao, Y. Zhang, and S. Gao, Scene text detection
using graph model built upon maximally stable extremal regions,
Pattern Recognit. Lett., vol. 34, no. 2, pp. 107116, Jan. 2013.
[63] X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao, Accurate and robust
text detection: A step-in for text retrieval in natural scene images, in
Proc. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR), 2013,
pp. 10911092.
[64] H. I. Koo and D. H. Kim, Scene text detection via connected component clustering and nontext filtering, IEEE Trans. Image Process.,
vol. 22, no. 6, pp. 22962305, Jun. 2013.
[65] L. Sun, Q. Huo, W. Jia, and K. Chen, Robust text detection
in natural scene images by generalized color-enhanced contrasting
extremal region and neural networks, in Proc. Int. Conf. Pattern
Recognit. (ICPR), 2014, pp. 27152720.
[66] L. Kang, Y. Li, and D. Doermann, Orientation robust text line
detection in natural images, in Proc. Int. Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2014, pp. 40344041.
[67] W. Huang, Y. Qiao, and X. Tang, Robust scene text detection with
convolution neural network induced MSER trees, in Proc. 13th Eur.
Conf. Comput. Vis. (ECCV), 2014, pp. 497511.
[68] X.-C. Yin, W.-Y. Pei, J. Zhang, and H.-W. Hao, Multi-orientation
scene text detection with adaptive clustering, IEEE Trans. Pattern
Anal. Mach. Intell., vol. 37, no. 9, pp. 19301937, Sep. 2015.
[69] L. Sun, Q. Huo, W. Jia, and K. Chen, A robust approach for text
detection from natural scene images, Pattern Recognit., vol. 48, no. 9,
pp. 29062920, Sep. 2015.
[70] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, Detecting texts of arbitrary
orientations in natural images, in Proc. Int. Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2012, pp. 10831090.
[71] W. Huang, Z. Lin, J. Yang, and J. Wang, Text localization in
natural images using stroke feature transform and text covariance
descriptors, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2013,
pp. 12411248.
[72] C. Yao, X. Bai, and W. Liu, A unified framework for multioriented
text detection and recognition, IEEE Trans. Image Process., vol. 23,
no. 11, pp. 47374749, Nov. 2014.
[73] M. Jaderberg, A. Vedaldi, and A. Zisserman, Deep features for
text spotting, in Proc. 13th Eur. Conf. Comput. Vis. (ECCV), 2014,
pp. 512528.
[74] K. Elagouni, C. Garcia, F. Mamalet, and P. Sbillot, Text recognition
in videos using a recurrent connectionist approach, in Artificial Neural
Networks and Machine LearningICANN. Berlin, Germany: Springer,
2012, pp. 172179.
[75] D. Chen, J.-M. Odobez, and H. Bourlard, Text detection and recognition in images and video frames, Pattern Recognit., vol. 37, no. 3,
pp. 595608, Mar. 2004.
[76] Z. Saidane and C. Garcia, Robust binarization for video text recognition, in Proc. 9th Int. Conf. Document Anal. Recognit. (ICDAR),
vol. 2. Sep. 2007, pp. 874879.
[77] D. Karatzas et al., ICDAR 2015 competition on robust reading, in
Proc. 13th Int. Conf. Document Anal. Recognit. (ICDAR), Aug. 2015,
pp. 11561160.

YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY

[78] R. Manmatha, C. Han, and E. M. Riseman, Word spotting: A new


approach to indexing handwriting, in Proc. Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 1996, pp. 631637.
[79] J. J. Weinman, Z. Butler, D. Knoll, and J. Feild, Toward integrated
scene text reading, IEEE Trans. Pattern Anal. Mach. Intell., vol. 36,
no. 2, pp. 375387, Feb. 2013.
[80] A. Mishra, K. Alahari, and C. V. Jawahar, Top-down and bottom-up
cues for scene text recognition, in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2012, pp. 26872694.
[81] C. Shi, C. Wang, B. Xiao, Y. Zhang, S. Gao, and Z. Zhang, Scene
text recognition using part-based tree-structured character detection, in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2013,
pp. 29612968.
[82] K. Wang and S. Belongie, Word spotting in the wild, in Proc. Eur.
Conf. Comput. Vis. (ECCV), 2010, pp. 591604.
[83] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven, PhotoOCR:
Reading text in uncontrolled conditions, in Proc. Int. Conf. Comput.
Vis. (ICCV), Dec. 2013, pp. 785792.
[84] O. Alsharif and J. Pineau, End-to-end text recognition with hybrid
HMM maxout models, in Proc. Int. Conf. Learn. Represent. (ICLR),
2014, pp. 110.
[85] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, Synthetic
data and artificial neural networks for natural scene text recognition,
CoRR, vol. abs/1406.2227, 2014.
[86] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, Reading
text in the wild with convolutional neural networks, Int. J. Comput.
Vis., vol. 116, no. 1, pp. 120, Jan. 2016.
[87] J. Almazn, A. Gordo, A. Forns, and E. Valveny, Word spotting
and recognition with embedded attributes, IEEE Trans. Pattern Anal.
Mach. Intell., vol. 36, no. 12, pp. 25522566, Dec. 2014.
[88] A. Gordo, Supervised mid-level features for word image representation, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2015, pp. 29562964.
[89] P. X. Nguyen, K. Wang, and S. Belongie, Video text detection and
recognition: Dataset and benchmark, in Proc. IEEE Winter Conf. Appl.
Comput. Vis., Mar. 2014, pp. 776783.
[90] A. Yilmaz, O. Javed, and M. Shah, Object tracking: A survey, ACM
Comput. Surv., vol. 38, no. 4, 2006, Art. no. 13.
[91] A. Milan, S. Roth, and K. Schindler, Continuous energy minimization
for multitarget tracking, IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 36, no. 1, pp. 5872, Jan. 2014.
[92] N. Wang, J. Shi, D.-Y. Yeung, and J. Jia. (Apr. 23, 2015). Understanding and diagnosing visual tracking systems. [Online]. Available:
http://arxiv.org/abs/1504.06055
[93] K. Mikolajczyk and C. Schmid, A performance evaluation of local
descriptors, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10,
pp. 16151630, Oct. 2005.
[94] R. Lienhart and F. Stuber, Automatic text recognition in digital
videos, in Proc. Electron. Imag., Sci. Technol., 1996, pp. 180188.
[95] R. Lienhart and W. Effelsberg, Automatic text segmentation and
text recognition for video indexing, Multimedia Syst., vol. 8, no. 1,
pp. 6981, Jan. 2000.
[96] H. Shiratori, H. Goto, and H. Kobayashi, An efficient text capture
method for moving robots using DCT feature and text tracking,
in Proc. 18th Int. Conf. Pattern Recognit. (ICPR), vol. 2. 2006,
pp. 10501053.
[97] M. Tanaka and H. Goto, Autonomous text capturing robot using
improved DCT feature and text tracking, in Proc. 9th Int. Conf.
Document Anal. Recognit. (ICDAR), vol. 2. Sep. 2007, pp. 11781182.
[98] R. Lienhart and A. Wernicke, Localizing and segmenting text in
images and videos, IEEE Trans. Circuits Syst. Video Technol., vol. 12,
no. 4, pp. 256268, Apr. 2002.
[99] W. Huang, P. Shivakumara, and C. L. Tan, Detecting moving text
in video using temporal information, in Proc. 19th Int. Conf. Pattern
Recognit. (ICPR), Dec. 2008, pp. 14.
[100] D. G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis., vol. 60, no. 2, pp. 91110, 2004.
[101] Y. Na and D. Wen, An effective video text tracking algorithm based on
SIFT feature and geometric constraint, in Proc. 11th Pacific Rim Conf.
Multimedia Adv. Multimedia Inf. Process. (PCM), 2010, pp. 392403.
[102] T. Q. Phan, P. Shivakumara, T. Lu, and C. L. Tan, Recognition of
Video Text through Temporal Integration, in Proc. 12th Int. Conf.
Document Anal. Recognit. (ICDAR), Aug. 2013, pp. 589593.
[103] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, Speeded-up robust
features (SURF), Comput. Vis. Image Understand., vol. 110, no. 3,
pp. 346359, 2008.

2771

[104] T. Yusufu, Y. Wang, and X. Fang, A video text detection and tracking
system, in Proc. IEEE Int. Symp. Multimedia (ISM), Dec. 2013,
pp. 522529.
[105] H. Li and D. Doermann, Automatic text tracking in digital videos,
in Proc. IEEE 2nd Workshop Multimedia Signal Process., Dec. 1998,
pp. 2126.
[106] H. Li and D. Doermann, Text enhancement in digital video using
multiple frame integration, in Proc. 7th ACM Int. Conf. Multimedia
(ACM MM), 1999, pp. 1922.
[107] J. Xi, X.-S. Hua, X.-R. Chen, L. Wenyin, and H.-J. Zhang, A video
text detection and recognition system, in Proc. ICME, Aug. 2001,
pp. 873876.
[108] V. Fragoso, S. Gauglitz, S. Zamora, J. Kleban, and M. Turk, TranslatAR: A mobile augmented reality translator, in Proc. IEEE Workshop
Appl. Comput. Vis. (WACV), Jan. 2011, pp. 497502.
[109] S. Benhimane and E. Malis, Real-time image-based tracking of planes
using efficient second-order minimization, in Proc. IEEE/RSJ Int.
Conf. Intell. Robots Syst. (IROS), vol. 1. Sep./Oct. 2004, pp. 943948.
[110] M. A. Fischler and R. Bolles, Random sample consensus: A paradigm
for model fitting with applications to image analysis and automated
cartography, Commun. ACM, vol. 24, no. 6, pp. 381395, 1981.
[111] M. Muja and D. G. Lowe, Fast approximate nearest neighbors with
automatic algorithm configuration, in Proc. VISAPP, vol. 1. 2009,
pp. 331340.
[112] L. Gomez and D. Karatzas, MSER-based real-time text detection
and tracking, in Proc. 22nd Int. Conf. Pattern Recognit. (ICPR),
Aug. 2014, pp. 31103115.
[113] M. Donoser and H. Bischof, Efficient maximally stable extremal
region (MSER) tracking, in Proc. IEEE Comput. Soc. Conf. Comput.
Vis. Pattern Recognit., vol. 1. Jun. 2006, pp. 553560.
[114] M. Tanaka and H. Goto, Text-tracking wearable camera system
for visually-impaired people, in Proc. 19th Int. Conf. Pattern
Recognit. (ICPR), Dec. 2008, pp. 14.
[115] H. Goto and M. Tanaka, Text-tracking wearable camera system for
the blind, in Proc. 10th Int. Conf. Document Anal. Recognit. (ICDAR),
Jul. 2009, pp. 141145.
[116] C. Merino and M. Mirmehdi, A framework towards realtime detection
and tracking of text, in Proc. 2nd Int. Workshop Camera-Based
Document Anal. Recognit. (CBDAR), 2007, pp. 1017.
[117] C. Merino-Gracia, K. Lenc, and M. Mirmehdi, A head-mounted device
for recognizing text in natural scenes, in Proc. 4th Int. Workshop
Camera-Based Document Anal. Recognit., 2012, pp. 2941.
[118] R. Minetto, N. Thome, M. Cord, N. J. Leite, and J. Stolfi, Snoopertrack: Text detection and tracking for outdoor videos, in Proc. 18th
IEEE Int. Conf. Image Process. (ICIP), Sep. 2011, pp. 505508.
[119] C. Wolf, J. M. Jolion, and F. Chassaing, Text localization, enhancement and binarization in multimedia documents, in Proc. 16th Int.
Conf. Pattern Recognit., vol. 2. 2002, pp. 10371040.
[120] C. Mi, Y. Xu, H. Lu, and X. Xue, A novel video text extraction
approach based on multiple frames, in Proc. 5th Int. Conf. Inf.,
Commun. Signal, 2005, pp. 678682.
[121] J. Zhou, L. Xu, B. Xiao, R. Dai, and S. Si, A robust system for text
extraction in video, in Proc. Int. Conf. Mach. Vis. (ICMV), Dec. 2007,
pp. 119124.
[122] W. Zhen and W. Zhiqiang, An efficient video text recognition system,
in Proc. 2nd Int. Conf. Intell. Human-Mach. Syst. Cybern. (IHMSC),
vol. 1. Aug. 2010, pp. 174177.
[123] X. Liu and W. Wang, Robustly extracting captions in videos based on
stroke-like edges and spatio-temporal analysis, IEEE Trans. Multimedia, vol. 14, no. 2, pp. 482489, Apr. 2012.
[124] B. Wang, C. Liu, and X. Ding, A research on video text tracking and
recognition, Proc. SPIE, vol. 8664, p. 86640G, Mar. 2013.
[125] X. Rong, C. Yi, X. Yang, and Y. Tian, Scene text recognition in
multiple frames based on text tracking, in Proc. IEEE Int. Conf.
Multimedia Expo (ICME), Jul. 2014, pp. 16.
[126] Y. Nakajima, A. Yoneyama, H. Yanagihara, and M. Sugano, Movingobject detection from MPEG coded data, Proc. SPIE, vol. 3309,
pp. 988996, Jan. 1998.
[127] M. Pilu, Using raw MPEG motion vectors to determine global camera
motion, Proc. SPIE, vol. 3309, pp. 448459, Jan. 1998.
[128] S. Antani, D. Grandall, and R. Kasturi, Robust extraction of text in
video, in Proc. Int. Conf. Pattern Recognit., Sep. 2000, pp. 18311834.
[129] U. Gargi, D. Crandall, S. Antani, T. Gandhi, R. Keener, and R. Kasturi,
A system for automatic text detection in video, in Proc. 12th Int.
Conf. Document Anal. Recognit., Sep. 1999, pp. 2932.

2772

[130] D. Crandall, S. Antani, and R. Kasturi, Extraction of special effects


caption text events from digital video, Int. J. document Anal. Recognit.,
vol. 5, nos. 23, pp. 138157, Apr. 2003.
[131] J. Gllavata, R. Ewerth, and B. Freisleben, Tracking text in MPEG
videos, in Proc. 12th Annu. ACM Int. Conf. Multimedia (ACM MM),
2004, pp. 240243.
[132] B.-L. Yeo and B. Liu, Rapid scene analysis on compressed video,
IEEE Trans. Circuits Syst. Video Technol., vol. 5, no. 6, pp. 533544,
Dec. 1995.
[133] H. Jiang et al., A fast and effective text tracking in compressed
video, in Proc. 10th IEEE Int. Symp. Multimedia (ISM), Dec. 2008,
pp. 136141.
[134] G. R. Bradski, Computer vision face tracking for use in a perceptual user interface, in Proc. 4th IEEE Workshop Appl. Comput.
Vis. (WACV), 1998, pp. 115.
[135] G. K. Myers and B. Burns, A robust method for tracking scene text
in video imagery, in Proc. CBDAR, 2005, pp. 16.
[136] X. Zhao, K. H. Lin, Y. Fu, Y. Hu, Y. Liu, and T. S. Huang, Text from
corners: A novel approach to detect text and caption in videos, IEEE
Trans. Image Process., vol. 20, no. 3, pp. 790799, Mar. 2011.
[137] A. Mosleh, N. Bouguila, and A. Ben Hamza, Automatic inpainting
scheme for video text detection and removal, IEEE Trans. Image
Process., vol. 22, no. 10, pp. 44604472, Nov. 2013.
[138] C. Merino-Gracia and M. Mirmehdi, Real-time text tracking in natural
scenes, IET Comput. Vis., vol. 8, no. 6, pp. 670681, Dec. 2014.
[139] E. A. Wan and R. Van Der Merwe, The unscented Kalman filter
for nonlinear estimation, in Proc. IEEE Adapt. Syst. Signal Process.,
Commun., Control Symp. (AS-SPCC), Oct. 2000, pp. 153158.
[140] Z.-Y. Zuo, S. Tian, and X.-C. Yin, Multi-strategy tracking based text
detection in scene videos, in Proc. 10th Int. Conf. Document Anal.
Recognit. (ICDAR), Aug. 2015, pp. 6670.
[141] S. Tian, W.-Y. Pei, Z.-Y. Zuo, and X.-C. Yin, Scene text detection in
video by learning locally and globally, in Proc. 25th Int. Joint Conf.
Artif. Intell. (IJCAI), 2016.
[142] H. Li and D. Doermann, Superresolution-based enhancement of text
in digital video, in Proc. 15th Int. Conf. Pattern Recognit., vol. 1.
Sep. 2000, pp. 847850.
[143] X.-S. Hua, P. Yin, and H.-J. Zhang, Efficient video text recognition
using multiple frame integration, in Proc. Int. Conf. Image Process.,
vol. 2. 2002, pp. II-397II-400.
[144] X.-S. Hua, X.-R. Chen, L. Wenyin, and H.-J. Zhang, Automatic
location of text in video frames, in Proc. ACM Workshops Multimedia,
Multimedia Inf. Retr., 2001, pp. 2427.
[145] J.-C. Shim, C. Dorai, and R. Bolle, Automatic text extraction from
video for content-based annotation and retrieval, in Proc. 14th Int.
Conf. Pattern Recognit., vol. 1. Aug. 1998, pp. 618620.
[146] T. Sato, T. Kanade, E. K. Hughes, M. A. Smith, and S. Satoh, Video
OCR: Indexing digital news libraries by recognition of superimposed
captions, Multimedia Syst., vol. 7, no. 5, pp. 385395, Sep. 1999.
[147] J. Yi, Y. Peng, and J. Xiao, Using multiple frame integration for the
text recognition of video, in Proc. 10th Int. Conf. Document Anal.
Recognit. (ICDAR), Jul. 2009, pp. 7175.
[148] T. Mita and O. Hori, Improvement of video text recognition by
character selection, in Proc. 6th Int. Conf. Document Anal. Recognit.,
Sep. 2001, pp. 10891093.
[149] J. Greenhalgh and M. Mirmehdi, Recognizing text-based traffic signs,
IEEE Trans. Intell. Transp. Syst., vol. 16, no. 3, pp. 13601369,
Jun. 2015.
[150] X.-S. Hua, L. Wenyin, and H.-J. Zhang, An automatic performance
evaluation protocol for video text detection algorithms, IEEE Trans.
Circuits Syst. Video Technol., vol. 14, no. 4, pp. 498507, Apr. 2004.
[151] V. Manohar et al., Performance evaluation of text detection and
tracking in video, in Proc. IAPR Int. Workshop Document Anal. Syst.,
2006, pp. 576587.
[152] R. Kasturi et al., Framework for performance evaluation of face,
text, and vehicle detection and tracking in video: Data, metrics, and
protocol, IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2,
pp. 319336, Feb. 2009.
[153] D. Karatzas et al., ICDAR 2013 robust reading competition, in
Proc. Int. Conf. Document Anal. Recognit. (ICDAR), Aug. 2013,
pp. 14841493.
[154] X. Liu, W. Wang, and T. Zhu, Extracting captions in complex background from videos, in Proc. 20th Int. Conf. Pattern
Recognit. (ICPR), Aug. 2010, pp. 32323235.
[155] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young,
ICDAR 2003 robust reading competitions, in Proc. 12th Int. Conf.
Document Anal. Recognit., vol. 2. Aug. 2003, pp. 682687.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 6, JUNE 2016

[156] A. F. Smeaton and P. Over, The TREC-2002 video track report, in


Proc. 17th Text Retr. Conf. (TREC), Nov. 2002, pp. 117.
[157] H. Kuwano, Y. Taniguchi, H. Arai, M. Mori, S. Kurakake, and
H. Kojima, Telop-on-demand: Video structuring and retrieval based on
text recognition, in Proc. IEEE Int. Conf. Multimedia Expo (ICME),
vol. 2. Jul. 2000, pp. 759762.
[158] D. Zhang and S.-F. Chang, Event detection in baseball video using
superimposed caption recognition, in Proc. 10th ACM Int. Conf.
Multimedia (ACM MM), 2002, pp. 315318.
[159] L. Agnihotri and N. Dimitrova, Text detection for video analysis, in Proc. IEEE Workshop Content-Based Access Image Video
Libraries (CBAIVL), Jun. 1999, pp. 109113.
[160] R. Lienhart, C. Kuhmunch, and W. Effelsberg, On the detection
and recognition of television commercials, in Proc. IEEE Int. Conf.
Multimedia Comput. Syst., Jun. 1997, pp. 509516.
[161] U. Gargi, S. Antani, and R. Kasturi, Indexing text events in digital
video databases, in Proc. 14th Int. Conf. Pattern Recognit., vol. 1.
Aug. 1998, pp. 916918.
[162] K.-Y. Jeong, K. Jung, E. Y. Kim, and H. J. Kim, Neural networkbased text location for news video indexing, in Proc. Int. Conf. Image
Process. (ICIP), vol. 3. Oct. 1999, pp. 319323.
[163] Y.-K. Lim, S.-H. Choi, and S.-W. Lee, Text extraction in MPEG
compressed video for content-based indexing, in Proc. 15th Int. Conf.
Pattern Recognit., vol. 4. Sep. 2000, pp. 409412.
[164] V. C. Dinh, S. S. Chun, S. Cha, H. Ryu, and S. Sull, An efficient
method for text detection in video based on stroke width
similarity, in Proc. 8th Asian Conf. Comput. VisionACCV, 2007,
pp. 200209.
[165] K. Iwatsuka, K. Yamamoto, and K. Kato, Development of a guide dog
system for the blind with character recognition ability, in Proc. 17th
Int. Conf. Pattern Recognit. (ICPR), vol. 1. May 2004, pp. 401405.
[166] X. Chen and A. L. Yuille, AdaBoost learning for detecting and reading
text in city scenes, Dept. Statist., Univ. California, Los Angeles, Los
Angeles, CA, USA, Tech. Rep., 2004.
[167] J.-P. Peters, C. Thillou, and S. Ferreira, Embedded reading device
for blind people: A user-centered design, in Proc. Int. Symp. Inf.
Theory (ISIT), Oct. 2004, pp. 217222.
[168] N. Ezaki, M. Bulacu, and L. Schomaker, Text detection from natural
scene images: Towards a system for visually impaired persons, in
Proc. 17th Int. Conf. Pattern Recognit. (ICPR), vol. 2. Aug. 2004,
pp. 683686.
[169] J. Chmiel, O. Stankiewicz, W. Switala, M. Tluczek, and J. Jelonek,
Read it project report: A portable text reading system for the blind
people, Dept. Comput. Sci. Manage., Poznan Univ. Technol., Poznan,
Poland, Tech. Rep., 2005.
[170] P. Sanketi, H. Shen, and J. M. Coughlan, Localizing blurry and
low-resolution text in natural images, in Proc. IEEE Workshop Appl.
Comput. Vis. (WACV), Jan. 2011, pp. 503510.
[171] I. Haritaoglu, Scene text extraction and translation for handheld
devices, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Recognit. (CVPR), vol. 2. Dec. 2001, pp. II-408II-413.
[172] X. Shi and Y. Xu, A wearable translation robot, in Proc. IEEE Int.
Conf. Robot. Autom. (ICRA), Apr. 2005, pp. 44004405.
[173] H. Aoki, B. Schiele, and A. Pentland, Realtime personal positioning
system for a wearable computer, in Proc. 3rd Int. Symp. Wearable
Comput., Oct. 1999, pp. 3743.
[174] Y. T. Cui and Q. Huang, Character extraction of license plates
from video, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Recognit., Jun. 1997, pp. 502507.
[175] S. H. Park, K. I. Kim, K. Jung, and H. J. Kim, Locating car
license plates using neural networks, Electron. Lett., vol. 35, no. 17,
pp. 14751477, Aug. 1999.
[176] W. Wu, X. Chen, and J. Yang, Incremental detection of text on
road signs from video with application to a driving assistant system,
in Proc. 12th Annu. ACM Int. Conf. Multimedia (ACM MM), 2004,
pp. 852859.
[177] W. Wu, X. Chen, and J. Yang, Detection of text on road signs from
video, IEEE Trans. Intell. Transp. Syst., vol. 6, no. 4, pp. 378390,
Dec. 2005.
[178] J. Greenhalgh and M. Mirmehdi, Detection and recognition of painted
road markings, in Proc. 4th Int. Conf. Pattern Recognit. Appl.
Methods, 2015, pp. 130138.
[179] D. Ltourneau, F. Michaud, J.-M. Valin, and C. Proulx, Textual
message read by a mobile robot, in Proc. IEEE/RSJ Int. Conf. Intell.
Robots Syst. (IROS), vol. 3. Oct. 2003, pp. 27242729.

YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY

[180] D. Ltourneau, F. Michaud, and J.-M. Valin, Autonomous mobile


robot that can read, EURASIP J. Appl. Signal Process., vol. 2004,
no. 17, pp. 26502662, Dec. 2004.
[181] X.-C. Yin et al., DeTEXT: A database for evaluating text extraction from biomedical literature figures, PLoS ONE, vol. 10, no. 5,
p. e0126200, 2015.
[182] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara,
A. Dehghan, and M. Shah, Visual tracking: An experimental survey,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp. 14421468,
Jul. 2014.
[183] X.-C. Yin, H.-W. Hao, J. Sun, and S. Naoi, Robust vanishing point
detection for MobileCam-based documents, in Proc. Int. Conf. Document Anal. Recognit. (ICDAR), Sep. 2011, pp. 136140.
[184] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan, A robust
arbitrary text detection system for natural scene images, Expert Syst.
Appl., vol. 41, no. 18, pp. 80278048, Dec. 2014.
[185] P. Clark and M. Mirmehdi, Rectifying perspective views of text in
3D scenes using vanishing points, Pattern Recognit., vol. 36, no. 11,
pp. 26732686, Nov. 2003.
[186] S. Lu, B. Chen, and C. C. Ko, Perspective rectification of document
images using fuzzy set and morphological operations, Image Vis.
Comput., vol. 23, no. 5, pp. 541553, May 2005.
[187] M. Pilu, Extraction of illusory linear clues in perspectively skewed
documents, in Proc. Int. Conf. Comput. Vis. Pattern Recognit. (CVPR),
Dec. 2001, pp. 363368.
[188] S. Pollard and M. Pilu, Building cameras for capturing documents,
Int. J. Document Anal. Recognit., vol. 7, no. 2, pp. 123137, Jul. 2005.
[189] X.-C. Yin et al., A multi-stage strategy to perspective rectification
for mobile phone camera-based document images, in Proc. Int. Conf.
Document Anal. Recognit. (ICDAR), Sep. 2007, pp. 574578.
[190] C. L. Zitnick and P. Dollr, Edge boxes: Locating object proposals
from edges, in Proc. 13th Eur. Conf. Comput. Vis. (ECCV), 2014,
pp. 391405.
[191] P. Dollr, R. Appel, S. Belongie, and P. Perona, Fast feature pyramids
for object detection, IEEE Trans. Pattern Anal. Mach. Intell., vol. 36,
no. 8, pp. 15321545, Aug. 2014.
[192] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan, Recognizing text
with perspective distortion in natural scenes, in Proc. IEEE Int. Conf.
Comput. Vis. (ICCV), Dec. 2013, pp. 569576.
[193] D. Kumar, M. N. A. Prasad, and A. G. Ramakrishnan, Multi-script
robust reading competition in ICDAR 2013, in Proc. Int. Workshop
Camera-Based Document Anal. Recognit. (CBDAR), 2013, Art. no. 14.
[194] U. Pal, Language, script, and font recognition, in Handbook of
Document Image Processing and Recognition. London, U.K.: Springer,
2014, pp. 291330.
[195] P. Shivakumara, Z. Yuan, D. Zhao, T. Lu, and C. L. Tan, New gradientspatial-structural features for video script identification, Comput. Vis.
Image Understand., vol. 130, pp. 3553, Jan. 2015.

Xu-Cheng Yin (SM16) received the B.Sc. and


M.Sc. degrees in computer science from the University of Science and Technology, Beijing, China,
in 1999 and 2002, respectively, and the Ph.D. degree
from the Institute of Automation, Chinese Academy
of Sciences, in 2006. From 2006 to 2008, he was
a Researcher with the Information Technology Laboratory, Fujitsu Research and Development Center.
From 2013 to 2014, he was a Visiting Researcher
with the Center for Intelligent Information Retrieval,
University of Massachusetts Amherst, USA. He is
currently a Professor with the Department of Computer Science and Technology, University of Science and Technology Beijing. He has authored
over 50 research papers (the IEEE T RANSACTIONS ON PATTERN A NALYSIS
AND M ACHINE I NTELLIGENCE, PLoS ONE, Information Fusion, IJCAI,
SIGIR, CIKM, ICDAR, and ICPR). His research interests include machine
learning, computer vision, information retrieval, and document analysis and
recognition. His team won the first place of both Text Localization in
Real Scenes and Text Localization in Born-Digital Images in the ICDAR
2013 Robust Reading Competition, and also won the first place of both
End-To-End Text Recognition in Real Scenes (Generic) and End-To-End Text
Recognition in Born-Digital Images (Generic) in the ICDAR 2015 Robust
Reading Competition.

2773

Ze-Yu Zuo received the B.Sc. degree in computer


science from the University of Science and Technology Wuhan, Hubei, China, in 2013, and the
M.Sc. degree in computer science from the University of Science and Technology Beijing, China,
in 2016. She is currently an Engineer with
weibo.com, China. Her research interests include
video text tracking, multimedia understanding, and
retrieval.

Shu Tian received the B.Sc. and Ph.D. degrees in


computer science from the University of Science
and Technology, Beijing, China, in 2010 and 2016,
respectively. His research interests include object
tracking, pattern recognition, and multimedia
understanding.

Cheng-Lin Liu (F14) received the B.S. degree


in electronics engineering from Wuhan University,
China, in 1989, the M.E. degree in electronics engineering from Beijing Polytechnic University, China,
in 1992, and the Ph.D. degree in pattern recognition
and intelligent control from the Chinese Academy
of Sciences, Beijing, China, in 1995. He was a
Post-Doctoral Fellow with the Korea Advanced
Institute of Science and Technology and with the
Tokyo University of Agriculture and Technology
from 1996 to 1999. From 1999 to 2004, he was a
Research Staff Member and as a Senior Researcher with the Central Research
Laboratory, Hitachi, Ltd., Tokyo, Japan. He is currently a Professor with
the National Laboratory of Pattern Recognition, Institute of Automation,
Chinese Academy of Sciences, Beijing, China, and the Director of the National
Laboratory of Pattern Recognition. He has authored over 200 technical papers
at prestigious international journals and conferences. His research interests
include pattern recognition, image processing, neural networks, and machine
learning, especially the applications to character recognition and document
analysis. He is a fellow of the IAPR. He serves on the Editorial Board
of Pattern Recognition, Image and Vision Computing, and the International
Journal on Document Analysis and Recognition.

Das könnte Ihnen auch gefallen