Beruflich Dokumente
Kultur Dokumente
I. I NTRODUCTION
HE explosive growth of smart phones and online social
media have led to the accumulation of large amounts
of visual data, in particular, the massive and increasing collections of video on the Internet and social networks. For
example, YouTube1 streamed approximately 100 hours of
video per minute worldwide in 2014. These countless videos
have triggered research activities in multimedia understanding
and video retrieval [3], [4].
In the literature, text has received increasing attention as
a key and direct information sources in video. As examples,
1057-7149 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY
2753
Fig. 1. Text in video: (a) Layered caption text, (b) embedded caption text, and (c) scene text, where embedded caption text and scene text are more challenging
to detect, track and recognize.
of spatial-temporal information, we extensively review techniques of text tracking, tracking based detection, and tracking
based recognition. Finally, available datasets, representative
challenges, technological applications and future directions of
video text extraction (especially from scene videos and web
videos) are also described in depth in this paper.
A. Text in Video
Following the method of categorization in [9], text in video
is categorized as either caption or scene text (see examples
in Fig.1). Caption text is also called graphic text [1] or
artificial text [13]. Caption text provides good directivity and
a high-level overview of the semantic information in captions,
subtitles and annotations of the video, while scene text is
part of the camera images and is naturally embedded within
objects (e.g., trademarks, signboards and buildings) in scenes.
Moreover, we classify caption text into two subcategories: layered caption text and embedded caption text. Layered caption
text is always printed on a specifically designed background
layer (see Fig.1(a)), while embedded caption text is overlaid
and embedded on the frame (see Fig.1(b)). Generally speaking,
scene text and embedded caption text are more challenging to
detect, track and recognize which are also the focus of this
survey.
II. A U NIFIED F RAMEWORK FOR V IDEO T EXT
D ETECTION , T RACKING AND R ECOGNITION
Some researchers have presented specific frameworks for
video text extraction. For example, Antani et al. divided video
text extraction into four tasks: detection, localization, segmentation, and recognition [14]. In their system, the tracking
stage provides additional input to the spatial-temporal decision
fusion for improving localization. Jung et al. summarized the
subproblems of a text information extraction system for both
images and video into text detection, localization, tracking,
extraction and enhancement, and recognition [9]. The video
text recognition flowchart designed by Elagouni et al. [15] is
similar to that of [9], but added a correction (postprocessing)
step with natural language processing. In contrast, in this
paper we propose a unified video text extraction framework,
where text detection, tracking, and recognition techniques are
uniformly described and surveyed.
Fig. 2.
D ETR: A unified framework for text D Etection, Tracking and
Recognition in video. This framework uniformly describes detection, tracking,
recognition (the three main tasks), and their relations and interactions. The
major relations among these main tasks are unified as detection-basedrecognition, tracking-based-detection and tracking-based-recognition.
The other relations among these tasks are also named as refinement-byrecognition (for detection), refinement-by-recognition (for tracking) and
tracking-with-detection.
2754
YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY
Fig. 3.
2755
A tree-style presentation of the different categories of methods for text tracking, detection and recognition using multiple frames.
Elagouni et al. [15] reduced the ambiguities involved in character segmentation by considering character recognition results
and introducing some linguistic knowledge. Chen et al. [75]
described a multiple-hypotheses framework and proposed a
new gray-scale consistency constraint (GCC) algorithm to
improve character segmentation. Saidane and Garcia [76]
presented a video text-oriented binarization method using a
specific architecture of convolutional neural networks.
Similarly, recognizing text in scene videos has attracted
more and more interests from researchers in the fields of document analysis and recognition, computer vision, and machine
learning [77]. Therefore, we also briefly summarize several
representative methods for text recognition in scene images.
As for scene text (cropped word) recognition, the existing methods can be grouped into segmentation-based word
recognition and holistic word recognition (word spotting [78]).
In general, segmentation-based word recognition methods
integrate character segmentation and character recognition
with language priors using optimization techniques, such as
Markov models [79] and CRFs [80], [81]. Given a lexicon of
words, the goal in word spotting is to identify specific words
in scene images without character segmentation [82]. Most
text recognition methods rely on text segmentation (removing
background). Fortunately, most CC-based detection methods
and some region-based methods already output text images
without the background. On the other hand, some recognition
methods use classifiers (such as CNNs) directly on text regions
mixed with the background. However, this approach requires
a large number of training samples of various text fonts and
backgrounds to train the classifier.
In recent years, mainstream segmentation-based word
recognition techniques typically over-segment the word image
into small segments, combine adjacent segments into candidate
characters, classify them using CNNs or gradient featurebased classifiers, and find an approximately optimal word
recognition result using beam search [83], Hidden Markov
Models [84], or dynamic programming [73]. Word spotting
methods usually calculate a similarity measure between the
2756
YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY
2757
(n)
255
c
z=0
2758
3) Tracking With Tracking-by-Detection: The tracking-bydetection method associates detected results across successive
frames to create the appearances of objects, namely, estimating
the tracking trajectories using text detection results. Compared
to other tracking methods, tracking-by-detection successfully
solves the re-initialization problem even when the object is
accidentally lost in some frames. It also avoids excessive
model drift due to the similar appearances of different objects.
Here, we discuss tracking-by-detection methods in a view
with the features used for region matching (matching regions
from text detection) moving from simple to complicated,
e.g., location overlap, edge maps, character strokes, Harris
corners, and MSERs.
Wolf et al. [119] described a simple matching strategy for
tracking-by-detection that makes use of the overlap information between the list of text bounding boxes detected in
the previous and current frames to associate the same text.
To further reduce false alarms, the length of the appearance
is used as a measure of stability. However, because overlap
information is used, the method is not suitable for handling
text in motion.
Similarly, Mi et al. [120] used text region location similarity
and edge map similarity to determine the starting and ending
frames of frame sequences containing the same text object.
Region location similarity is referenced by the overlap of two
text regions in different frames. Only when the two similarities
are greater than a specified threshold value, the text regions
are considered the same; otherwise, they are considered as
different text regions.
Except for overlap information, text polarity and character
strokes are also used to verify whether the two text blocks are
identical in consecutive frames [121]. If the two text blocks
possess the same text polarity and character strokes, and one
text block overlaps the other to a sufficient extent, then they
are considered as the same text. The systems performance
may be enhanced by increasing the usage of text polarity and
character strokes. However, this method only addresses static
text effectively.
Petter et al. [20] extended the AR text translator of
Fragoso et al. [108] with automatic text detection by using
connected component analysis of the Canny edge detector outputs. However, these two systems differ in their handling of the
temporal redundancy. The system of Petter et al. [20] detects
text in each frame and performs matching them between the
previous and current frames based on the areas and centroids
of the detected text blocks. In constrast, the tracking method
used by Fragoso et al. [108] is more robust and effective with
the ESM algorithm.
Zhen and Zhiqiang [122] presented a text tracking method
for static text by fusing detection results. Their method first
extracts the Harris corner features of text and uses them to
search for the corresponding position in the current frame.
Then, the Hausdorff distance is applied to measure the dissimilarity. When the Hausdorff distance between a text block
and the reference text block is less than a given threshold, the
text block is considered to be a tracked (same) text block.
Liu and Wang [123] proposed a robust method for
extracting captions in video based on stroke-like edges and
YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY
2759
2760
YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY
which mainly includes multi-frame averaging, time-based minimum/maximum pixel value searching, and Boolean And
operations.
To achieve the better results, it is necessary to select the
most appropriate text regions for character recognition. We call
techniques with this simple and basic strategy selection-based
methods. In [96], [97], [114], the region with the longest
horizontal length is selected as the most appropriate region
because characters in this region are usually the biggest.
Furthermore, Goto and Tanaka [115] changed the selection
algorithm to avoid a delay in message presentation by taking
six features into account: text region area and width, Fishers
discriminant ratio, the number of vertical edges, the sum of
the absolute values of the vertical components, and the vertical
edge intensity. If a feature value is at a local maximum and
is the highest among the past peaks in a chain as calculated
every two seconds, the text region will be passed to the next
process.
Obviously, the selection-based methods have some limitations regarding blurred text. In contrast, in such situations, the
integration-based methods may obtain better results. The most
common strategy is multi-frame averaging, where tracked text
regions with the same ID label are averaged to obtain a new
text region. The integration-based methods filter out complex
local backgrounds and improve text quality, thereby increasing
the recognition accuracy. Multi-frame averaging is used in
many tracking based text recognition techniques [42], [102],
[106], [107], [119], [122], [124], [142], [143]. Specifically, the
average operation can be applied to produce an average luminance image [42]. Xi et al. [107] used multi-frame averaging
to obtain a new intensity image after tracking a text block over
5 consecutive frames, where each text block is first enlarged
before performing the averaging operation. Phan et al. [102]
first employed a multi-frame averaging technique in mask
areas corresponding to original text regions. Then, a simple thresholding was applied to the average intensity image
to obtain an initial binarization of the word area. Finally,
the shapes of the characters are refined using the intensity
values.
Similarly, Hua et al. [143] first selected some high contrast
frames (HCFs) by calculating the percentage of dark pixels
around text blocks to apply averaging on them. To solve
problems when only part of text is readable or clear in HCFs,
a text area decomposition method [144] was used to divide
a text block into words on the averaging frame of the HCFs.
They searched for the high contrast text blocks (HCBs) and
averaged all the corresponding HCBs to obtain a clearer block.
Finally, theose clearer blocks are merged to form a whole text
region.
Some other multi-frame averaging methods for image
enhancement exist. For example, a combination of image
interpolation and multi-frame averaging can be applied to
enhance the image. By first using bilinear interpolation to
improve the resolution of each text region, a new text region
can be obtained by averaging 30 text regions [106] or all the
interpolated images [119]. In Wang and Weis method [122],
a text region is first segmented into smaller blocks. The
corresponding clearer blocks are selected by comparing the
2761
2762
TABLE I
S UMMARY OF T EXT T RACKING , T RACKING BASED D ETECTION AND T RACKING BASED R ECOGNITION
M ETHODS A PPLIED W ITH T EXT C ATEGORY: C APTION T EXT OR / AND S CENE T EXT
YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY
2763
N f rames
Nmapped
A. Evaluation Protocols
The evaluation protocols for text detection, tracking and
recognition in video have been presented in [21], [42], [89],
[101], [121], [124], [130], [133], and [150][154]. Here, we
summarize several mainstream evaluation methods.
Detection Evaluation Protocols: There are three criteria for measuring text detection results in video. The first
criterion is speed, which indicates the average processing
time per frame in text detection. The second criterion is
precision, which evaluates the percentage of text regions
correctly detected compared to the text region claimed as
follows,
number of correctly detected text regions
.
Precision =
number of detected text regions
(1)
M OT P =
N f
rames
t =1
(t)
(t)
(t)
|G i Di |
t =1
i=1
(t)
|G i Di |
.
(3)
(t )
Nmapped
M OT A = 1
t =1
( f n t + f pt + i d_swt )
N f
rames
t =1
(4)
NG(t )
(t )
The ATA is the normalized STDA per text and is defined as,
AT A =
ST D A
NG +N D
2
.
(6)
2764
TABLE II
B ENCHMARK D ATASETS AVAILABLE FOR T EXT V IDEO E XTRACTION , W HERE IN THE TASK C OLUMN , D, T, S
AND R R ESPECTIVELY R EPRESENT D ETECTION , T RACKING , S EGMENTATION AND R ECOGNITION
Fig. 4. Sample frames of text from the MSRA-I, MoCA, Merino-Gracia, Minetto, ICDAR13 / ICDAR15 datasets. (a) MSRA-I Embedded Caption Text
(Graphic Text) Dataset. (b) MoCA Embedded Caption Text (Graphic Text) Dataset. (c) Merino-Gracia Scene Text Dataset. (d) Minetto Scene Text Dataset.
(e) ICDAR13 / ICDAR15 Scene Text Dataset.
YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY
2765
TABLE III
S TATE - OF - THE -A RT P ERFORMANCE ON AVAILABLE V IDEO T EXT D ATASETS , W HERE P RECISION AND R ECALL A RE FOR T EXT D ETECTION ,
MOTPm MOTA AND ATA A RE FOR T EXT T RACKING , AND D AND R M EAN T RACKING I S A PPLIED FOR M EASURING D ETECTION
AND R ECOGNITION , R ESPECTIVELY. T HE I NTERVAL VALUE I S E STIMATED F ROM THE F IGURE IN THE C ORRESPONDING R EFERENCE
2766
TABLE IV
C HALLENGES OF T EXT D ETECTION , T RACKING AND R ECOGNITION FOR S CENE T EXT AND E MBEDDED C APTION IN V IDEO
Assisting Visually Impaired People: Real-time text extraction technology in the wild can help visually impaired
people understand scenes in their surrounding environment.
A handheld PDA-based system was developed to help blind
people accomplish daily tasks [167]. The system can be
viewed as a loop that includes the user taking a snapshot,
text/picture detection, optical character recognition, text-tospeech synthesis, and feedback to the user, until it reaches a
useful output. Ezaki et al. [168] described a system equipped
with a PDA, a CCD-camera and a voice synthesizer to
assist visually impaired persons by detecting text objects
from natural scenes and transforming them into voice signals.
Similarly, a guide dog system [165], a portable text reading
system [169], an autonomous robot [96], [97] and a wearable
camera system [114], [115] were all designed and constructed
for visually impaired people.
Real-Time Translation: Text extraction is also important for translation purposes (e.g., for tourists or robots).
Haritaoglu [171] developed an automatic sign/text language
translation system for foreign travelers. Detected text in a
scene be translated into a travelers native language by this
system. Shi and Xu [172] presented a wearable translation
robot that can automatically translate multiple languages in
real time. The robot consists of a camera mounted on reading
glasses together with a head-mounted display used as the
output device for the translated text. A mobile augmented
reality (AR) translator on a mobile phone using a smart-phone
camera and touchscreen is described in [20] and [108].
User Navigation: User navigation can pinpoint a users
position and provide routes to a destination in real time.
Aoki et al. [173] designed a small camera mounted on a baseball cap intended for user navigation in a scenic environment.
A street view navigation system was also constructed in [118].
Traffic Monitoring and Driving Assistance Systems: In general, the most effective way to monitor traffic is to obtain
and track license plates. Cui and Huang [174] proposed
a Markov Random Field (MRF) model-based method to
extract characters from the license plates of moving vehicles.
Park et al. [175] used neural networks to identify car
YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY
2767
2768
YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY
2769
2770
YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY
2771
[104] T. Yusufu, Y. Wang, and X. Fang, A video text detection and tracking
system, in Proc. IEEE Int. Symp. Multimedia (ISM), Dec. 2013,
pp. 522529.
[105] H. Li and D. Doermann, Automatic text tracking in digital videos,
in Proc. IEEE 2nd Workshop Multimedia Signal Process., Dec. 1998,
pp. 2126.
[106] H. Li and D. Doermann, Text enhancement in digital video using
multiple frame integration, in Proc. 7th ACM Int. Conf. Multimedia
(ACM MM), 1999, pp. 1922.
[107] J. Xi, X.-S. Hua, X.-R. Chen, L. Wenyin, and H.-J. Zhang, A video
text detection and recognition system, in Proc. ICME, Aug. 2001,
pp. 873876.
[108] V. Fragoso, S. Gauglitz, S. Zamora, J. Kleban, and M. Turk, TranslatAR: A mobile augmented reality translator, in Proc. IEEE Workshop
Appl. Comput. Vis. (WACV), Jan. 2011, pp. 497502.
[109] S. Benhimane and E. Malis, Real-time image-based tracking of planes
using efficient second-order minimization, in Proc. IEEE/RSJ Int.
Conf. Intell. Robots Syst. (IROS), vol. 1. Sep./Oct. 2004, pp. 943948.
[110] M. A. Fischler and R. Bolles, Random sample consensus: A paradigm
for model fitting with applications to image analysis and automated
cartography, Commun. ACM, vol. 24, no. 6, pp. 381395, 1981.
[111] M. Muja and D. G. Lowe, Fast approximate nearest neighbors with
automatic algorithm configuration, in Proc. VISAPP, vol. 1. 2009,
pp. 331340.
[112] L. Gomez and D. Karatzas, MSER-based real-time text detection
and tracking, in Proc. 22nd Int. Conf. Pattern Recognit. (ICPR),
Aug. 2014, pp. 31103115.
[113] M. Donoser and H. Bischof, Efficient maximally stable extremal
region (MSER) tracking, in Proc. IEEE Comput. Soc. Conf. Comput.
Vis. Pattern Recognit., vol. 1. Jun. 2006, pp. 553560.
[114] M. Tanaka and H. Goto, Text-tracking wearable camera system
for visually-impaired people, in Proc. 19th Int. Conf. Pattern
Recognit. (ICPR), Dec. 2008, pp. 14.
[115] H. Goto and M. Tanaka, Text-tracking wearable camera system for
the blind, in Proc. 10th Int. Conf. Document Anal. Recognit. (ICDAR),
Jul. 2009, pp. 141145.
[116] C. Merino and M. Mirmehdi, A framework towards realtime detection
and tracking of text, in Proc. 2nd Int. Workshop Camera-Based
Document Anal. Recognit. (CBDAR), 2007, pp. 1017.
[117] C. Merino-Gracia, K. Lenc, and M. Mirmehdi, A head-mounted device
for recognizing text in natural scenes, in Proc. 4th Int. Workshop
Camera-Based Document Anal. Recognit., 2012, pp. 2941.
[118] R. Minetto, N. Thome, M. Cord, N. J. Leite, and J. Stolfi, Snoopertrack: Text detection and tracking for outdoor videos, in Proc. 18th
IEEE Int. Conf. Image Process. (ICIP), Sep. 2011, pp. 505508.
[119] C. Wolf, J. M. Jolion, and F. Chassaing, Text localization, enhancement and binarization in multimedia documents, in Proc. 16th Int.
Conf. Pattern Recognit., vol. 2. 2002, pp. 10371040.
[120] C. Mi, Y. Xu, H. Lu, and X. Xue, A novel video text extraction
approach based on multiple frames, in Proc. 5th Int. Conf. Inf.,
Commun. Signal, 2005, pp. 678682.
[121] J. Zhou, L. Xu, B. Xiao, R. Dai, and S. Si, A robust system for text
extraction in video, in Proc. Int. Conf. Mach. Vis. (ICMV), Dec. 2007,
pp. 119124.
[122] W. Zhen and W. Zhiqiang, An efficient video text recognition system,
in Proc. 2nd Int. Conf. Intell. Human-Mach. Syst. Cybern. (IHMSC),
vol. 1. Aug. 2010, pp. 174177.
[123] X. Liu and W. Wang, Robustly extracting captions in videos based on
stroke-like edges and spatio-temporal analysis, IEEE Trans. Multimedia, vol. 14, no. 2, pp. 482489, Apr. 2012.
[124] B. Wang, C. Liu, and X. Ding, A research on video text tracking and
recognition, Proc. SPIE, vol. 8664, p. 86640G, Mar. 2013.
[125] X. Rong, C. Yi, X. Yang, and Y. Tian, Scene text recognition in
multiple frames based on text tracking, in Proc. IEEE Int. Conf.
Multimedia Expo (ICME), Jul. 2014, pp. 16.
[126] Y. Nakajima, A. Yoneyama, H. Yanagihara, and M. Sugano, Movingobject detection from MPEG coded data, Proc. SPIE, vol. 3309,
pp. 988996, Jan. 1998.
[127] M. Pilu, Using raw MPEG motion vectors to determine global camera
motion, Proc. SPIE, vol. 3309, pp. 448459, Jan. 1998.
[128] S. Antani, D. Grandall, and R. Kasturi, Robust extraction of text in
video, in Proc. Int. Conf. Pattern Recognit., Sep. 2000, pp. 18311834.
[129] U. Gargi, D. Crandall, S. Antani, T. Gandhi, R. Keener, and R. Kasturi,
A system for automatic text detection in video, in Proc. 12th Int.
Conf. Document Anal. Recognit., Sep. 1999, pp. 2932.
2772
YIN et al.: TEXT DETECTION, TRACKING AND RECOGNITION IN VIDEO: A COMPREHENSIVE SURVEY
2773