Sie sind auf Seite 1von 6

International Journal on Recent and Innovation Trends in Computing and Communication

Volume: 4 Issue: 3

ISSN: 2321-8169
511 - 516

_______________________________________________________________________________________

Text Extraction from Captured Image and Conversion to Audio for Smart Phone
Application
Sneha V. Deshmukh1, Jaishree M. Kamble1, Abhilasha B. Kharate1, Pooja G. Deo1,
Supriya A. Khadasne1 , Prof. V.B.Gadicha2
1

Students of Computer Science & Engineering Department, P.R. Pote COE Amravati University, India
HOD of Computer Science & Engineering Department, P.R. Pote COE Amravati University, India
Email: snehadeshmukh33@gmail.com, jayashrikamble400@gmail.com, abhilashakharate@gmail.com
poojadeo18@gmail.com , supriya.khadasne@gmail.com , headcse1108@gamil.com
2

Abstract Text extraction from captured image by smart phone is difficult task due muddle background and non-textual portion. Again the text is
in a variety of fonts, styles, sizes, and having different words where every word may contain different characters in dissimilarities of text patterns. If
we can ignored the problems of muddle background and text separation for the some instant, again there are several other reasons as font style and
variations in size word by word or character by character; background as well as foreground colour; camera position which can lead distortions;
brightness and image resolution. The proposed technique is firstly, Capture the image from mobile camera and it is a color image. Then the colour
image is converted into gray scale image and then gray scale image is converted into binary image. This binary image is gives to the Optical character
recognition (OCR) engine which recognize and extract the text from image and gives to the Text to Speech (TTS) engine. The Text to Speech engine
is converting the text into audio.
Keywords-- Text Extraction, OCR engine, Text to Speech engine, Smart Phone Application

____________________________________________________*****_________________________________________________
I. INTRODUCTION
Extracting text from captured images or videos is an
important problem in many applications like document
processing, image indexing, video content summary, video
retrieval, video understanding. Capturing images and videos, text
characters and strings usually appear in nearby sign boards and
hand-held objects and provide significant knowledge of
surrounding environment and objects. Capturing images usually
suffer from low resolution and low quality, perspective distortion
and complex background. [1]
Text on captured images is hard to detect, extract and
recognize since it can appear with any slant, tilt, in any lighting,
upon any surface and may be partially occluded. Many
approaches for text detection from capturing images have been
proposed recently. To extract text information by mobile devices
from captured image, automatic and efficient scene text detection
algorithm is essential. A character descriptor is proposed extract
representative and discriminative features from character patches.
It combines several feature detectors (Optimal Character
Recognition (OCR) [1]. The main focus of our project is that
visually challenged person can get text information which is
present in the text boards, instructions on traffic sign boards and
hoardings in audio form. With this point of view, the application
design for a camera based reading system that remove text from
text board and identify the text characters and strings from the
captured image and finally, text will be converted into audio.
This system allows the user to photograph a text image such as
Stop sign board on road side and click on speak button and hear

the text aloud and gives real time feedback. Extracting text
information from captured image is difficult task due clutter
background and non-text outliers, and further, text consists of
dissimilar words where every word may contain different
characters in a variety of fonts, styles, and sizes, resulting in large
intra-variations of text patterns. Even if the problems of clutter
background and text segmentation were to be ignored for the
moment, there are several other reasons such as font style and
thickness; background as well as foreground color and texture;
camera position which can introduce geometric distortions;
illumination and image resolution. Optical character recognition
is the electronic conversion of photographed images of printed
text into computer-readable text. A text-to-speech system
converts normal language text into speech. It is usually meant to
help visually challenged people and other people also. [2]
II. LITERATURE SURVEY
Several methods that can be used for extracting text from images
such as document images, scene images etc. Texts that are
present in an image contain several useful and important
information. In this paper, they employ discrete wavelet
transform (DWT) to extract the text from an image. The image
that will be passing as an input can be a color image or it can be
gray-scale image. If the image is a color image, then
preprocessing is to be done on an image. In order to extract the
text edges from an image, so label edge detector is used on each
sub image.
511

IJRITCC | March 2016, Available @ http://www.ijritcc.org

__________________________________________________________________________________________

International Journal on Recent and Innovation Trends in Computing and Communication


Volume: 4 Issue: 3

ISSN: 2321-8169
511 - 516

_______________________________________________________________________________________

Fig 1: Working Of OCR [3]


1. Text Detection:
In this stage, since there is no prior information on whether
or not the input image contains any text, the existence or nonexistence of text in the image must be determined. Several
approaches assume that certain types of video frame or image
contain text. This is a common assumption for scanned images.
However, in the case of video, the number of frames containing
text is much smaller than the number of frames without text.
The mobile phone camera will capture images formatted as
RGB8888. Leptonica will be used to convert the RGB8888
formatted images into 8-bit luma formatted images and store
them into C data structures. This pre-processing step will help
prepare the captured images from the mobile device for text
analysis by Tesseract [4].
2. Text Extraction:
To detect and extracting the text from camera captured
natural scene images. Text finding is used to obtain text
containing image area then text identification to transform imagebased information into readable text. This step refers to classify
the characters as they are in original image [5]. This is done by
multiplying resultant figure with binary converted novel image.
Final result is the white text in black background dependent on
the novel image. Text characters consist of strokes with constant
or variable orientation as the basic structure [6]. Here, we
propose a new type of feature, stroke orientation, to describe the
local structure of text characters. From the pixel based analysis,
stroke direction is vertical to the slope orientations at pixels of
stroke borders. To model the text structure by stroke orientations,
we propose a new operator to map a gradient feature of strokes to
each pixel. It extends the local structure of a stroke edge into its
neighborhood by gradient of orientations [6].
3. Gray Scale:
Grayscale image is an image in which each pixel value is a
single sample which carries only intensity information. Such
images are called as black and white images. The pixels intensity
is expressed between a minimum and maximum range. Grayscale
images are also called monochromatic that is the presence of only
one (mono) color (chrome).
4. Thresholding:
Thresholding is the simplest method of image segmentation.
It can be used to create binary images from grayscale image. In
the thresholding process, individual pixels in an image are
marked as object pixels if their value of object pixel is greater
than some threshold value (assuming an object to be brighter than

the background) and as background pixels otherwise. This


convention is known as threshold above. Variants include
threshold below, which is opposite of threshold above; threshold
inside, where a pixel is labeled object if its value is between two
thresholds; and threshold outside, which is the opposite of
threshold inside [7]. Typically, an object pixel is given a value of
1 while a background pixel is given a value of 0. Finally, a
binary image is created by coloring each pixel white or black,
depending on a pixel's labels.
5. Text to Speech:
The mobile speaker is to inform the user of extracted text
codes in the type of speech or audio. Mobile phone speaker is
employed for speech output. Text extraction is performed by the
current OCR earlier to crop of useful terms from the extracted
text areas [8]. A text area labels the minimum rectangular part for
the place of lettering within it, so the margin of the text area links
the edge border line of the text quality. On the other hand, the
current system provides better performance if text section are
first assigned proper margin areas and binaries to segment text
characters from environment [8]. Thus each restricted text part is
enlarged by enhancing the height and width by 10 pixels
respectively. We test both open and closed basis solutions exist
that have APIs that allocate the ending stage of translation to
letter codes.
6. Summary and Discussion:
Experiments were performed to test the text detection
and recognition system which is developed as an application on
android mobile phone. Test data used in the experiments consists
of pictures of text boards taken at surroundings which include
scenes of different text boards. All the test images were taken by
Samsung Galaxy GT-I8262 embedded five mega pixels camera.
We photographed a set of images on the basis of distance and
light condition and evaluated the performance of system. All of
the text strings in captured images are in horizontal and of same
size and same font. We captured the image from the four various
distances such as from 4 feet distance, 6 feet distance, 8 feet
distance, 10 feet distance and observed the results. For this we
had taken 10 test images and performed experiments 70 times on
one image and calculated the percentage of correct recognition
for each text image. The percentage of correct recognition is
calculated as the ratio between the number of correct recognition
and number of experiments [9].

Fig 2: Distance Vs Average Correct Recognition Rate [9]


512

IJRITCC | March 2016, Available @ http://www.ijritcc.org

__________________________________________________________________________________________

International Journal on Recent and Innovation Trends in Computing and Communication


Volume: 4 Issue: 3

ISSN: 2321-8169
511 - 516

_______________________________________________________________________________________

Fig 3: Illumination Vs Average Correct Recognition Rate [9]


III. PROPOSED WORK
The problem for this system is extracting text
information from captured image is difficult task due clutter
background and non-text outliers, and further, text consists
of dissimilar words where every word may contain different
characters in a variety of fonts, styles, and sizes, resulting in
large intra-variations of text patterns. Even if the problems of
clutter background and text segmentation were to be ignored
for the moment, there are several other reasons such as: Font
style and thickness, Background as well as foreground colour
and texture, Camera position which can introduce geometric
distortions, Image resolution.
1. Proposed Design:
The proposed design for text extraction from captured
images and conversion to audio for smart phone application is
shown in fig (4). The propose system design consist six stages
they are as follows:
i. Capture an image:
In this stage of system, we can capture the image for
extraction of text. This image is the colour image which gives the
input to the next stage of system.
ii. Convert colour image to gray scale image:
In second stage of system, it takes colour image as an input
and converts it into gray scale image. The gray scale image helps
to efficient text identification.
iii. Convert gray scale image to binary image:
In this stage, the output gray scale image of second step is
taken as input for this stage. This gray scale image is converted
into the binary image (Digital data). The output binary image is
given to the next stage.
iv. Text recognition:
In this stage, the text recognition is done using OCR (Optical
Character Recognition) engine for recognize the text from image.
v. Text extraction:
OCR engine is recognize the text area from an image and
gives to the text extraction stage. In this stage, the recognize text
will be extract from the image.
vi. Text to Audio conversion:
When the text is extract from the image it gives to this stage.
In this stage text will be converted to the audio and it is in
English (U. S.) language.

Fig 4: Proposed System


2. System Design:
The detail flow design of Text extraction from captured
image and conversion to audio for smart phone application as
shown in fig(5). There are five steps required for executing the
system effectively. As the end user the execution start from XML
files which provides the user interface. But the actual execution
of system is start from Java Native Interface to the upward
direction.
i. Java Native Interface:
The java native interface (JNI) is a programming framework
that enables java code running in a Java Virtual Machine (JVM)
to call and be called by native applications. This java native
interface is the libraries written in other languages such as C,
C++ and assembly. In our system Java native interface consist
C++ programming files which support the OCR engine.

Fig 5: detail flow diagram for system


ii. Leptonica Library:
Leptonica library uses the java native interface (C++ native
files) using the java classes. The leptonica library provides the
features are as follows:
Binary morphology:
This is a source for efficient implementations of binary
morphology. This code is automatically generated. Binary
morphology is implemented two ways:
513

IJRITCC | March 2016, Available @ http://www.ijritcc.org

__________________________________________________________________________________________

International Journal on Recent and Innovation Trends in Computing and Communication


Volume: 4 Issue: 3

ISSN: 2321-8169
511 - 516

_______________________________________________________________________________________
i.
ii.

Successive full image rasterops for arbitrary structuring


elements (Sels)
Destination word accumulation (dwa) for specific Sels.
Gray scale morphology:
It gives an efficient implementation of grayscale morphology
for brick Sels. Brick Sels are separable into linear horizontal and
vertical elements. They use the van Herk/Gil-Werman algorithm
that performs the calculations in a time that is independent of the
size of the Sels. We also provide grayscale rank order filters for
brick filters. The rank order filter is a generalization of grayscale
morphology, which selects the rank-valued pixel (rather than the
min or max). A colour rank order filter applies the grayscale rank
operation independently to each of the (r, g, b) components.
Image scaling:
Leptonica provides many simple and relatively efficient
implementations of image scaling. Grayscale and color images
are scaled using:
a) Sampling
b) Lowpass filtering followed by sampling
c) Area mapping
d) Linear interpolation
Image shear and rotation:
Image shear is implemented with both rasterops and linear
interpolation. The rasterop implementation is faster and has no
constraints on image depth. The interpolated shear is used on 8
bpp and 32 bpp images, and gives a smoother result. Shear is
used for the fastest implementations of rotation.
There are three different types of general image rotators:
a) Grayscale rotation using area mapping
b) Rotation of an image of arbitrary bit depth,
using either 2
or 3 shears.
c) Rotation by sampling.
Sequential algorithms:
It provides a number of fast sequential algorithms, including
binary and grayscale seedfill, and the distance function for a
binary image. The most efficient binary seedfill is pixSeedfill(),
which uses Luc Vincent's algorithm to iterate raster- and
antiraster-ordered propagation, and can be used for either 4- or 8connected fills. Similar raster/antiraster sequential algorithms are
used to generate a distance map from a binary image, and for
grayscale seedfill. We also use Heckbert's stack-based filling
algorithm for identifying 4- and 8-connected components in a
binary image. A fast implementation of the watershed transform,
using a priority queue, is included.
Image enhancement:
A few simple image enhancement routines for grayscale and
color images have been provided. These include intensity
mapping with gamma correction and contrast enhancement, as
well as edge sharpening, smoothing, and hue and saturation
modification.

Image I/O:
Some facilities have been provided for image input and
output. This is of course required to build executable that handle
images, and many examples of such programs, most of which are
for testing, can be built in the prog directory. Functions have
been provided to allow reading and writing of files in JPEG,
PNG, TIFF, BMP, PNM, GIF, WEBP and JP2 formats.
iii. Tesseract API:
Tesseract is written in the C++ programming language, it is
no trivial task to use it on the Java-based Android OS. The C++
code needs to be wrapped in a Java class and run natively via the
Java Native Interface (JNI). Though there is some effort
involved, one great benefit to running Tesseract natively is that
C++ is substantially faster than Java. Tesseract-OCR uses liblept
mainly for image I/O, but you can also use any of Leptonicas
many image processing functions on PIX, while at the same time
calling TessBaseAPI methods.

Architecture of Tesseract:
Tesseract works with independently developed Page Layout
Analysis Technology. Hence Tesseract accepts input image as a
binary image. Tesseract can handle both, the traditional- Black on
White text and vise versa. Outlines of component are stored on
connected Component Analysis. Nesting of outlines is done
which gathers the outlines together to form a Blob. Such Blobs
are organized into text lines. Text lines are analyzed for fixed
pitch and proportional text. Then the lines are broken into words
by analysis according to the character spacing. Fixed pitch is
chopped incharacter cells and proportional text is broken into
words by definite spaces and fuzzy spaces. Tesseract performs
activity to recognize words. This recognition activity is mainly
consists of two passes. The first pass tries to recognize the words.
Then satisfactory word is passed to Adaptive Classifier as
training data, which recognizes the text more accurately. During
second pass, the words which were not recognized well in first
pass are recognized again through run over the page. Finally
Tesseract resolves fuzzy spaces. To locate small and capital text
Tesseract checks alternative hypothesis for x-height.

Fig 6: Text Extraction from image using OCR engine

514
IJRITCC | March 2016, Available @ http://www.ijritcc.org

__________________________________________________________________________________________

International Journal on Recent and Innovation Trends in Computing and Communication


Volume: 4 Issue: 3

ISSN: 2321-8169
511 - 516

_______________________________________________________________________________________

Text To Speech Engine:


A text to speech (TTS) synthesizer is a system that can read
text aloud automatically, which is extracted from Optical
Character Recognition (OCR). A speech synthesizer can be
implemented by both hardware and software. Speech synthesis is
the artificial production of human speech [10]. A text-to-speech
(TTS) system converts normal language text into speech. A
synthesizer can incorporate a model of the vocal tract and other
human voice characteristics to create a completely "synthetic"
voice output.[11]

Leptonica library which inbuilt library developed by google.


Tesseract uses a combination of many different algorithms to
produce digital text from the images provided by Leptonica.
1. System Execution Details:
User must use an Android mobile phone. He/she will capture
the desired image. The OCR will ignore the non-textual region of
the picture and will print only the text. Also, the user has to
follow the required steps in order to avoid any error while using
the application. The application will work as following steps
given:
a) Click on captured button:

Fig 7: Text to Speech Conversion


a) Text Analysis & Detection:
The Text Analysis is part of preprocessing. It analyzes the
input text and organizes into manageable list of words. Then it
transforms them into full text.

b) Text Normalization & Linearization:


Text Normalization is the transformation of text to
pronounceable form. The main objective of this process is to
identify punctuation marks and pauses between words. Usually
the text normalization process is done for converting all letters of
lower or upper case, to remove punctuations, stop words or too
common words.
c) Phonetic Analysis:
It provides phonetic alphabets. The grapheme to phoneme
conversion is done. It is actually a conversion of orthographical
symbols into phonological symbols.

Fig 8: Image capturing


For execute or run this application our first step is click on
image capture button.as mention in the fig in this GUI we have
two image buttons one for capture image and second one is
speak-out button. Using first button we can take image from our
surrounding.
b) Capture an image using the smart phone's camera:

d) Acoustic Processing:
It performs formant synthesis. It works intelligently and thus
does not require any kind of database of speech samples. For
speak out the text, it uses voice characteristics of a person.
IV. SYSTEM IMPLEMENTATION
The system implementation is one of the important stage of
software design, in system implementation we actually
implements optimal character recognition (OCR) algorithm,
OTSU algorithm for text recognition and extraction. To do actual
implementation for this application we take support of

Fig 9: Captured image


515

IJRITCC | March 2016, Available @ http://www.ijritcc.org

__________________________________________________________________________________________

International Journal on Recent and Innovation Trends in Computing and Communication


Volume: 4 Issue: 3

ISSN: 2321-8169
511 - 516

_______________________________________________________________________________________
c) Crop an image as per our requirement and save it:

VI. CONCLUSION AND FUTURE SCOPE


We developed an application for text recognition and
extraction from captured image and then the conversion to audio
using android platform. The mobile phones are handy and more
suitable to use anywhere rather than the personal computers. The
android mobile camera captured image and extracts the text from
image. It identifies text areas from captured image and recognizes
text. Then extracted text is converted to speech using English (U.
S.) language. It helps to visually challenged people and also in
many official works like business car readers, vehicle number
plate detection etc.

Fig 10: Cropping and saving of an image

In future we will try to improve the accuracy rate of text


detection and text recognition by improved the implementation.
Also enhance this application, which will support for multilanguage text extraction and its audio conversion.

d) Extracting the text from captured image and it will show on


text box:

VII. REFERENCES
[1]

[2]

[3]

[4]

[5]

Fig 11: Extraction of text from captured image


[6]

e) Click on speak-out button:


[7]

[8]
[9]

[10]

Fig 12: Text to audio conversion

[11]

M. Prabaharan and K. Radha Text Extraction from Natural


Scene Images and Conversion to Audio in Smart Phone
Applications, in 10.15680/ijircce.2015.0301004
Ms.Rupali D. Dharmale, Dr. P. V. Ingole Text Detection and
Recognition with Speech Output for Visually Challenged
Person ,in (IJAIEM) 2319 4847, Volume 5, Issue 1, January
2016
Kalyani Mangale, Hemangi Mhaske in International Journal of
Computer Applications(0975 8887) National Conference on
Emerging Trends in Advanced Communication Technologies
(NCETACT-2015)
Mohit K.Chauhan et al, / (IJCSIT) International Journal of
Computer Science and Information Technologies, Vol. 5 (6) ,
2014, 8285-8292.
1.B. Epshtein, E. Ofek, and Y. Wexler, Detecting text in natural
scenes with stroke width transform, in Proc. CVPR, pp. 2963
2970, 2010.
A. Coates et al., Text detection and character recognition in
scene images with unsupervised feature learning, in Proc.
ICDAR, pp. 440445, 2011.
Gonzalez Rafael .C Digital Image Processing Pearson
Education Second Edition, Upper Saddle River New Jersey,
USA, 2002
Extraction of Text Objects in Image and Video Documents, by
Jing Zhang, University of South Florida
Chucai Yi, Yingli Tian, Scene Text Recognition in Mobile
Applications by Character Descriptor and Structure
Configuration, IEEE Transactions on Image Processing, Vol.
23, No. 7, July 2014
Smith, Ray (2007). An Overview of the Tesseract OCR
Engine, Ray Smith, Proc.Ninth Int. Conference on Document
Analysis and Recognition (ICDAR), 2007, pp. 629-633.
IMAGE PROCESSING TECHNIQUES FOR MACHINE
VISION Alberto Martin and Sabri Tosunoglu, Florida
International University

As mention in the fig we get extracted text to convert into


audio form we have to click on speak-out button. Hence we get
output Quantitative Aptitude in audio from.
516
IJRITCC | March 2016, Available @ http://www.ijritcc.org

__________________________________________________________________________________________

Das könnte Ihnen auch gefallen