Sie sind auf Seite 1von 6

International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) ISSN 2249-6831 Vol. 3, Issue 2, Jun 2013, 107-112 © TJPRC Pvt. Ltd.

Vol. 3, Issue 2, Jun 2013, 107-112 © TJPRC Pvt. Ltd. DETECTION AND EXTRACTION OF TEXT
Vol. 3, Issue 2, Jun 2013, 107-112 © TJPRC Pvt. Ltd. DETECTION AND EXTRACTION OF TEXT


MEGHA KHANDELWAL, BHUPENDRA KUMAR & TUSHAR PATNAIK Centre for Development of Advanced Computing, Noida, Uttar Pradesh, India


Detection and extraction of text present in video is in great demand for the purpose of video retrieval and indexing. It has become a very popular research area in recent past. Certain factors like complex background, color bleeding, noise, low resolution and low contrast make the automatic detection and extraction of text in video frames a challenging task. Looking at the growing popularity and the recent developments in the processing of text in video frames, this paper presents a survey of various techniques proposed for effective detection, localization, enhancement and extraction of text in video. The techniques for text detection are broadly categorized as heuristic, machine learning and hybrid.

KEYWORDS: Detection and Extraction, Heuristic Method, Machine Learning Method and Hybrid Method


Text in video contains useful information about the content in the video that can be used for the establishment of content based retrieval system for any large digital video database. The main purpose of video text detection and extraction system is to gather the text present in the video which can be fed to an OCR system so as to recognize text.

The text in video frames can be broadly classified into two categories superimposed (caption, artificial, or overlay) text and scene text. A Text which is captured within the scene is the scene text; it is an integral part of frame. Example include street signs, bill board etc. The appearance of such text is incidental to the scene content and is of no such importance except for in some applications such as navigation, surveillance or reading text on unknown objects. Scene text can appear in any kind of surfaces, in any orientation, can be of any size and perspective, making its extraction particularly difficult. Whereas, superimposed text is artificially added to video frames in order to provide information about the content in the scene. Examples of superimposed text include headlines, keyword summaries, time and location stamps, names of people and scores. Superimposed text have distinctive texture which makes it distinguishable from the background, its colors are chosen to have high contrast as from compared to the background, and the characters generally have uniform color and intensity which helps in its detection. Figure1 shows typical video text detection and extraction system:

in its detection. Figure1 shows typical video text detection and extraction system: . Figure 1: Structure

.Figure 1: Structure of the System


Megha Khandelwal, Bhupendra Kumar & Tushar Patnaik

Text detection and localization finds and define the actual location of the text present in the video frame by forming bounding boxes around the text. After the bounding boxes are found, the text lines are enhanced, extracted and binarized for OCR system. Section 2 of this paper discusses the various approaches used for detection and localization of text in video frames by the researches. In section 3 techniques for text enhancement and extraction have been discussed. Section 4 explains the method for performance evaluation for various techniques and section 5 discuss the conclusion of overall paper respectively.


Features mainly used for text detection in video are edge based and texture based. In edge based, rich edge information of text regions is used for fast text detection, whereas in texture based the candidate regions of variable scales are scored using binary classifier on extracting textural features (such as Gabor filter, wavelet transform, gradient orientations, local binary pattern (LBP) features). While using edge based features the performance relies on extraction of text edges, which are often contaminated by background images. Text detection methods are broadly categorized as:

heuristic method, machine learning method and hybrid method.

Heuristic Method

Heuristic method use empirical rules and thresholds in order to distinguish text from non text areas. These heuristic techniques are efficient and robust for frames with high contrast text and relatively smooth background. Shivakumara et al. [1] proposed a method applying Heuristic rules based on combination of filters and edge analysis for identifying a candidate text block in the frame. A 256 x 256 pixel frame is segmented into 16 non overlapping blocks. Rules using arithmetic filter, median filter and edge analysis identify candidate text block. A Complete text block is obtained using block growing method. Finally, based on vertical and horizontal bar feature, the true text regions are detected. Straightness and cursiveness properties are used to eliminate edges.

Shivakumara et al. [2] also proposed an efficient text detection method based on the Laplacian operator. The input frame is filtered using a 3x3 Laplacian mask. Mask gradient difference captures relationship between positive and negative valued Laplacian filtered frame. All pixels are classified into text and non text clusters using k- means based on Eucledian distance between MGD values. Boundary refinement is done using Sobel edge map of the text clusters followed by thresholding horizontal and vertical projection profile of the frame. Finally the geometrical properties like width, area, aspect ratio, height and edge area are used to eliminate false alarms. Poignant et al. [3] proposed an approach where first, Sobel filter is applied followed by dilation and erosion in order to connect the characters together. Then, corner detection is performed on each connected component and connected component that do not hold mandatory geometry are filtered. Detection is performed on each frame and the boxes sufficiently stable over time are kept. Obtained box are binarized using Otsu algorithm.

Min Cai et al. [4] proposed a scheme using features such as edge strength, edge density and horizontal distribution. First the Sobel operator is used to detect edges in YUV space which is then globally thresholded to filter out definitely non-edge points. After that a selective local threshold is performed to simplify the background. To further highlight the text area by edge density feature, edge strength smoothening operator and edge clustering power operator are used. Finally rules of average density, peak distribution in fine vertical projection and density distribution eliminate non text regions. Xi et al. [5] proposed a heuristic approach where edge map is obtained applying Sobel differential calculators followed by an edge thinning and a de- noising process on input frame, which is then binarized. Then the edge map smoothening is done followed by morphological opening to remove some noise areas. Following which region

Detection and Extraction of Text in Video - A Survey


decomposition is done in both horizontal and vertical orientations. Finally certain rules are applied for text block identification.

Machine Learning

Machine learning algorithms are based on classification techniques trained on text and non text patterns which scan the video frame in order to localize the text occurrences.

Xiaojun Li et al[11]propose a method where stroke maps are generated using stroke filters in horizontal, vertical, left diagonal and right diagonal directions. Then a 24 dimensional feature is extracted for each sliding window and a SVM is used to obtain the candidate text blocks which are further refined through a group of rules. Thereafter projection profile are used to obtain candidate text lines. And finally another SVM classifier based on 6 dimensional features verify the candidate text lines. Wenicke et al. [6] proposed a new and robust multi-resolution approach, where directional as well as the overall edge strength images are used as feature image. Then the fixed scale text detector classifies each pixel in the overall edge strength image based on its neighborhood. Neural feed forward network was used for developing classifier. The network is trained with the bootstrap method. This network require 7 training cycles. Salience map is created projecting the confidence of box being text to original scale of frame. In the next step text bounded box is extracted performing initial text bounding box extraction followed by revised text bounding box extraction and at last the text and color features are used.

Hybrid Approach

Hybrid methods usually consist of two stages. The first localizes text with a fast heuristic technique while the second verifies the previous results eliminating some boxes as false alarms.

Anthimopoulos et al. [7] proposed an approach consisting of two stages. In the first stage morphological operations are performed on edge map of frame generated using canny edge detector and then bounding boxes are determined. After that edge projection analysis splits text area into text lines. This algorithm is applied in different resolutions to ensure text detection with size variability. In the second stage result is verified using edge local binary pattern (eLBP) i.e. a modified LBP operator which describes the local edge patterns in a frame. Finally boundary box is extracted by saliency map generated by the response of the classifier. After this a region growing algorithm is applied in order to produce the final result.

Ye et al. [8] proposed an edge based method followed by a texture based method for text detection. In this method color edge detection algorithm based on Sobel operator is applied where threshold for edge detection is determined using entropic method. Then, morphological operations are applied and projection profile is used to separate multiple text lines to single text lines. Then certain rules are applied to get the candidate text. Finally candidate text is verified by wavelet features and SVM classifier, in which frame is decomposed with Daubechies wavelet and features are extracted from the decomposed frame. SVM classifier is used to train the classification model based on the determined feature vectors. Bootstrap process is used for better performance of classifier.

Sivkumara et al. [9] proposed a scheme of enhancing the input to Laplacian method in [2]. In this wavelet decomposition and color features namely R,G and B wavelet (Haar) is applied to obtain high frequency sub-bands. The average of high frequency sub-bands of R,G and B bands denoted by R-avg, G-avg, and B-avg are computed. Further, the average of R-avg, G-avg and B-avg (AoA) is computed. Laplacian method proposed in [2] is then applied on AoA for text detection.


Megha Khandelwal, Bhupendra Kumar & Tushar Patnaik


Hua et al. [10] proposed text enhancement and tracking method using multi frame averaging. Four methods namely, multiple frame verification, high contrast frame averaging, high contrast block averaging and block adaptive threshold are used to deal with the issue of unclear text. To reduce the false detection of textboxes in MFV only the textboxes in several consecutive frames are regarded as true textboxes and those textboxes whose location are almost same in consecutive frames are regarded as same text strings. In the next step only frames from the frame set with high contrast compared to their background are averaged; this deals with the situation when background is too complex or low contrast. In case a part of text is readable or clearer than clear textboxes from the frame that are segmented and averaged; the averaged result is merged to form a clear textbox. In BAT an adapting threshold based method is applied for each word box separately. In [5] also SSD (sum of square difference) based block matching is done for text enhancement and tracking. For each detected or tracked block in the previous frame, its corresponding position in the current frame is searched over a search window W. MSE (Mean Square Error) is taken to measure the dissimilarity. There was a significant reduction in false alarm after applying this method. Wernicke et al.[6] used signature based text-line search method for text tracking in a frame where vertical and horizontal projection profile are considered reference signature of a text line and searched for the region of same dimension in next frames which best matches the reference signature.


The categories for each detected block by text detection method are defined as follows:

Truly Detected Block (TDB): A detected block that contains a text line, partially or fully.

False Detected Block (FDB): A detected block that does not contain text.

Text Block with Missing Data (MDB): A detected block that misses some characters of a text line (MDB is a subset of TDB).

The performance measures are defined as follows:

Detection Rate (DR) = TDB / ADB

False Positive Rate (FPR) = FDB / (TDB + FDB)

Misdetection Rate (MDR) = MDB / TDB

The results derived by the researchers using various techniques are discussed in Table 1:

Table1: Comparison Table

/ TDB The results derived by the researchers using various techniques are discussed in Table 1:

Detection and Extraction of Text in Video - A Survey



Here we have broadly discussed about the various approaches available for text detection in video frames. The major challenges faced during text detection are low contrast of the text with respect to its background, low resolution of video frame, unknown color, size and orientation of the text and color bleeding due to lossy video compression. Among the above discussed techniques, heuristic approach is used for fast text detection but it is based on certain geometrical constraints derived from text characteristics. On the other hand, many machine learning approaches have been proposed in the last years for the detection of text areas with great success but it involves high computational complexity. Hybrid approach applies very fast heuristic approach followed by machine learning approach in order to refine the results.


1. Palaiahnakote Shivakumara, Trung Quy Phan and Chew Lim Tan “Video Text detection based on filters and edge features”, IEEE International Conference on Multimedia and Expo, pp 514-517, 2002.

2. Trung Quy Phan, Palaiahnakote Shivakumara and Chew Lim Tan “A laplacian method for video text detection” 10 th IEEE International Conference on Document analysis and recognition, 2009.

3. Johann Poignant, Franck Thollard, Georges Quenot and Laurent Besacier, “Text detection and recognition for person identification in videos”, 9th IEEE International Workshop on Content-Based Multimedia Indexing (CBMI), pp 245-248, 2011

4. Min Cai, Jiqiang Song and Michael R. Lyu, “ A new approach for video text detection”, IEEE International Conference on Image Processing , 2002.

5. Jie Xi , Xian-Sheng Hua , Xiang-Rong Chen , Liuwenyin , Hong-Jiang Zhang “A video text detection and recognition system”, IEEE International Conference on Multimedia and Expo, pp 876-876, 2001

6. Axel Wernicke and Rainer Lienhart, “On the segmentation of text in videos”, IEEE International Conference on Multimedia and Expo Vol: 3, pp 1511-1514, 2000

7. Marios Anthimopoulos, Basilis Gatos, Ioannis Pratikakis “A hybrid system for text detection in video frames”, IEEE, The Eighth IAPR International Workshop on Document Analysis Systems, pp 286-292, 2008

8. Qixiang Ye, Wen Gao1, Weiqiang Wang, Wei Zeng “A robust text detection algorithm in images and video frames”, Joint Conference of the Fourth International Conference on Information, Communications and Signal Processing and Fourth Pacific Rim Conference on Multimedia ,Vol. 2, pp 802-806, 2003

9. Palaiahnakote Shivakumara, Trung Quy Phan, Chew Lim Tan, “New wavelet and color features for text detection in video”, 20th IEEE International Conference on Pattern Recognition(ICPR), pp 3996-3999, 2010.

10. Xian-Sheng Hua, Pei Yin, Hong- jian Zhang, “Efficient video text recognition using multiple frame integration”, IEEE International Conference on Image Processing, Vol. 2 2002.

11. Xiaojun Li, Weiqiang Wang1, Shuqiang Jiang,Qingming Huang, Wen Gao, “Fast and Effective Text Detection”,15 th IEEE International Conference on Image Processing, pp 969 972, 2008.



112 AUTHOR ’S DETAILS Megha Khandelwal, Bhupendra Kumar & Tushar Patnaik Ms. Megha Khandelwal received her

Megha Khandelwal, Bhupendra Kumar & Tushar Patnaik

Ms. Megha Khandelwal received her B.Tech degree from Uttar Pradesh Technical University. Currently she is pursuing her M.Tech in Computer Science and Engineering from Centre for Development of Advance Computing (CDAC), Noida. Her interest area is Image processing, Algorithms, and Database Management Systems.

processing, Algorithms, and Database Management Systems. Mr. Bhupendra Kumar (Senior Technical Officer) joined CDAC

Mr. Bhupendra Kumar (Senior Technical Officer) joined CDAC in 2005, he received his M.Tech degree from IIIT Allahabad with the specialization in wireless communication and computing. His interest areas are Advanced Image processing, pattern recognition, computer network, wireless network, MANETs. Currently he is involved in project “Development of Robust Document Image Understanding System for Documents in Indian Scripts”.

Understanding System for Documents in Indian Scripts”. Mr. Tushar Patnaik (Sr. Lecturer/Sr. Project Engineer)

Mr. Tushar Patnaik (Sr. Lecturer/Sr. Project Engineer) joined CDAC in 1998. He has eleven years of teaching experience. His interest areas are Computer Graphics, Multimedia and Database Management System and Pattern Recognition. At present he is leading the consortium based project “Development of Robust Document Image Understanding System for Documents in Indian Scripts”.