Beruflich Dokumente
Kultur Dokumente
Bachelor of Engineering
Information Technology
Thesis
26 May 2014
Abstract
The aim of the project was to build a flashcard Android application to help people memo-
rize words. The major functions of the application include creating card books, adding or
removing words from the created book by typing or Optical character recognition (OCR)
and translating all the words automatically.
The application was developed on the Android platform with Android SDK and tesseract
OCR engine. An android word-storing application and a graph scanning application were
combined to implement the project. In order to call the Tesseract OCR library, it is neces-
sary to use Android NDK and Java Native Interface (JNI) to call the functions from the
Tesseract library. Microsoft translation API was used to help users translate all the entered
words to the expected language.
The results of the project proved that using the OCR technology to digitize paper-based
document was a feasible way to simplify the user input and it made archives searchable,
editable and machine readable. With the increase of the cameras imaging quality, the ac-
curacy of the OCR technology will perform more reliable results.
1 Introduction 1
5 Project Evaluation 29
5.1 Application Testing and Result 29
5.2 Further development and possible improvements 29
6 Conclusions 31
References 32
1
1 Introduction
This final year project is a mobile word-collection solution for my personal interest. It is
a trial and practice of my developing skills in the open source community. During the
past few years, the trend of digitizing paper-based material has emerged. Data and
information would ben much more valuable if they were available in digital form. Digiti-
zation technology has been applied in many fields in peoples daily lives, such as retail
businesses, post offices, insurance and aircraft companies[3]. The optical character
recognition (OCR) is one of the most commonly used digitization technologies.
The OCR concept was firstly invented to help the blind to read [1]. Nowadays OCR
software engines have been developed for various applications, such as receipt, in-
voice, check and legal billing [3]. The optical character recognition helps to reduce the
humans jobs of manual handling of data.
In the final year project, I integrated the Tesseract OCR engine into the application, as
it is open-source and free to use, released under the Apache License. The Tesseract
OCR engine can read a wide variety of image formats and convert them to texts in over
60 languages [1]. The Tesseract OCR engine was initially developed at HP Labs and it
has been extensively improved by Google [1].
Since there is very few of OCR solutions for a flash card application in the market, I
chose this technology as the topic for my final year project. Flash cards are a group of
cards with hint information on one side and answer on the other side. The thesis de-
scribes the development of the flash card Android applications and discusses the theo-
ries of the OCR technology, the process of software designing and developing, and the
result of the project.
2
Most people start to learn reading and writing during the first years of education. As
long as they have finished the basic education, people should have acquired writing
and reading skills. Regardless of the following situations such as: fancy font styles,
misspelled words, fragmented parts and figurative or artistic design, most people are
able to read and understand the contents by making use of experience and context. On
the contrary comparing to the human beings reading skills, theres still a long way to
go for machines to produce human-competitive text recognition.
Figure 1. Early OCR Reading Machine Copied from Cheriet (2007) [2]
3
With the invention of the digital-integrated circuit, the scanning speed and conversion
speed were highly accelerated [2].
In the early 1960s,various errors in the OCR were made in the case of poor print quali-
ty, wide variations in fonts and rough surface paper [2]. A leap happened in 1970s for
the OCR applications. The American National Standards Institute (ANSI) and the Euro-
pean Computer Manufacturers Association (ECMA) designed fonts custom-built for
OCR such as OCRA and OCRB [2]. The International Standards Organization (ISO)
soon adopted the customized fonts [2]. Consequently, higher recognition accuracy was
achieved. With all these accomplishments, the cost of high speed accurate OCR scan-
ning was sharply lowered.
Currently, in the OCR, it takes three primary steps for a captured image to be
recognized, as shown in Figure 2: preprocessing stage, feature extraction stage and
classification stage [2].
2.2.1 Preprocessing
After a raw image is taken, image preprocessing will be the first recognition stage. It is
responsible for eliminating the uninterested data, positioning the interested regions, are
enhancing the images to make the data efficiently used by the next step [2].
The feature extraction stage is used to extract the most relevant information from the
text image, which helps us to recognize the characters in the text [3]. The algorithms
and techniques applied in this stage for selecting representative feature is the core of
the whole recognition process and greatly affects the recognition quality The most pop-
ular feature extraction methods can be categorized into geometric features (moments,
histograms, and direction features), structural features (registration, line element fea-
tures, Fourier descriptors and topological features) and feature space transformation
methods (the principal component analysis (PCA), linear discriminant analysis, kernel
PCA) [2].
The classification stage utilizes the features that were extracted in the previous step. It
determines the feature space in which an unknown pattern falls. The classification step
is performed by comparing feature vectors corresponding to the input character with
the representative of each character class [4].
The OCR has been extensively developed for over 60 years, and it has been applied in
various fields [1]. OCR has helped human beings to reduce a huge amount of manual
typing work. With the flourishing of cloud computing services, smart mobile devices and
the release of the OCR engine in the past few years, The OCR is not only playing its
role as saving human labors, but also providing better user experience. Now OCR is
making its way to our daily lives. Some examples are shown as follows.
In 2012, an OCR feature was added to the Google translate service, as shown in Fig-
ure 3, so that users could translate text using only camera lens.
5
It provides users the convenience of quick input when they do not have an adaptive
keyboard or in case when the text is too long.
Figure 4. An Example of ALPR User Interface copied from David (2012) [6]
6
Figure 5. OCR full-page multi-illumination desktop readers copied from OCR640 full-
page multi-illumination desktop readers for ePassports and eID Cards (2014) [7]
It takes less than one second for the passengers and the security personnel to finish
the identity check in the airport, as shown in figure 5 [7]. The whole process of check-
ing and recording are automated. Since the data is digitalized, it provides the infor-
mationized way of further usage, such as movement record tracking [7].
One of the core difficulties in my final year project was to find an efficient and economi-
cal OCR solution for the application. There are numerous OCR engine/cloud service
providers, such as ABBYY Cloud OCR SKD, Bing OCR and Alternatives, offering pro-
grammers easy approaches to develop the OCR application for many different fields.
However, most of the OCR engines or service providers are commercial and closed-
source [7].
7
Figure 6. The results of current and old Tesseract copied from Ray (2014)[8]
The Tesseract OCR engine is an open-source and the most accurate free-to-use OCR
engine created by HP labs and extensively enhanced (as shown in Figure 6) by Google
at present. The recent version was shown as 2.0 and the original version in 1995 was
shown as HP. At the time this project was initialized, it was the only free Android toolkit
available in the market.
Unlike Google Translate smearing, Tesseract assumes that the input image is in a po-
lygonal binary area. The outlines of the component are gathered into Blobs [8]. Blobs
8
are arranged into text lines [8]. (In OCR, Blobs refers to those areas on the digital im-
age that are detected to be different from the surrounding regions in color or bright-
ness.) Tesseract detects text area by its line finding algorithm.
Tesseract firstly filters out blobs that are smaller than some fraction of the median
height, which could possibly be punctuation or noise. Then Tesseract processes the
filtered blobs horizontally and assigns them into a text line. After blobs have been as-
signed, the filtered-out blobs will be put back into the appropriate position.
Once text lines have been detected, the baselines are fitted more precisely using a
quadratic spline. The baselines are fitted by partitioning the blobs into groups with a
reasonably continuous displacement for the original straight baseline [8]. A quadratic
spline is fitted to the most popular partition, (assumed to be the baseline) by at least
squares fit [8]. The quadratic spline keeps the calculation result at a stable level. How-
ever, multiple splines lead to discontinuities as a disadvantage. As shown in figure 8,
Tesseract uses the baseline (green), the descended line (blue), the mean line (pink)
and the ascender line (cyan) to locate the text line.
Figure 8. An example of the curved fitted baseline copied from Ray (2014)[8]
After the text area has been detected, Tesseract will segment the words into character
pieces by analyzing if they are width equivalent, as shown in Figure 9.
When the fixed pitch text is found, the chopper on these words would be disa-
bled for the word recognition step [8].
9
Tesseract improves the unsatisfactory result by chopping the blob. It attempts to find
candidate chop points from concave vertices of a polygonal approximation of the out-
line [8]. At least three pairs of chop points assure that a separate character is from the
ASCII set, [8]. Figure 10 shows how Tesseract recognizes the character when the
character r touches the character m.
If the chops are all used up and the result is still not satisfying, the analysis task would
be taken over by the associator. The associator performs the best-first search and
makes the graph into candidate characters. Figure 11 shows an example of the broken
character.
The features of the broken character are extracted from the outline fragment by the
static classifier.
The many-to-one matching helps Tesseract easily recognize the broken characters, (as
shown in Figure 12) [9].
10
To improve the accuracy and extend the language limitation, the Tesseract engine is
designed to be fully trainable. After character segmentation, a two-pass process is per-
formed. In the first pass, an attempt is made to recognize each words in turn [8]. The
satisfactory character is passed to the dynamic classifier as training data [8]. In the
second pass, words that were not satisfyingly recognized are recognized again.
Users can train Tesseract to improve the accuracy of recognition for their own cases.
Figure 13 is a digital image waiting for recognition.
Some of the characters do not match the expectation. Figure 14 shows that the result
of recognition is not perfect.
With the sample image, I used the Tesseract shell command to generate a .box file,
which is for recognizing the character and position of the words. Then I provided the file
as an input to the jTessBoxEditor. The output is as follows in Figure 16. In this step,
some of the character recognition results (the number 1 and number 5) were not
what were expected.
In this case, I revised the wrong character and saved it to the result. After that I export-
ed the language-training file and saved it in the Tesseract data library .
Thus, Tesseract was trained by being told the mistakes it made so that it would not
make the same mistakes in later recognition.
12
Figure 18 shows the recognition result after Tesseract was trained. The result perfectly
matches the expectation.
13
Flash cards are a group of cards that containing hints information on one side and an-
swers on the other side manufactured for study purpose. Flash cards are widely used
for helping memorization. In this project, the flash card application acted a role as the
word-retention tool. There are numerous flashcard applications in the market, but the
motivation of this project was that I could barely find any flashcard application contain-
ing OCR input function. The most similar application I could find was in Google Trans-
late, but it did not fit the use case of flash cards. Since the translation process of the
Google Translate OCR process does not contain the word segmentation stage, the
expected words can not be spited into a customized word list. So I decided to develop
an Android application, which would both fit the Flash Card use case and be with OCR
technology for input convenience.
Figure 18 illustrates the steps that users follow in the Flash card application use case.
When a user starts the application, he/she can either choose to enter words by a cam-
era lens, or by manual typing, or to look back at the previous recorded words. Once the
word-input step is finished, a new word list is generated. All the words on the list will be
translated to the expected language automatically.
Thus, the Finnish words could be regarded as the hints information and the translation
could be regarded as the answer.
This section discusses about the primary workflow of the application from the develop-
ing perspective. As I have mentioned, the OCR task utilized the Tesseract OCR engine.
After words being captured by the camera, the string will be passed to the word list. In
this stage, all the punctuation characters and numbers are filtered out. If multiple words
are detected, the application will split them into a new row of a list. All the items of the
list contain a check box for the purpose of choosing the words. When the word list has
been confirmed, a new set of flash cards will be generated. Every item of the list can
be pulled down and closed up for the purpose of word memorizing. The translations of
each word are settled in the pulled down containers.
Figure 19 describes the basic process with the flowchart. When the words scanning or
inputting words manually has been done, the flash card data is stored into the data-
base inside the application.
16
Over the past few years, the blooming of smartphones has exerted a subtle influence
on peoples daily customs. As a fresh IT program student five years ago, the change
aroused my profound curiosity of mobile software specialization option main curriculum
of the degree. I spent most of my spare time on android development study. Thus I
adopted the Android system as the platform for the final year project.
4.1.1 Android
As the project started, there were four mobile operating systems performing noticeably
in the market. As shown in Figure 21, they were Android, Blackberry OS, IOS and Mi-
crosoft Windows phone OS. Among them, Android was dominating the market with a
share of over 80% [7]. One and a half million Android devices are activated every day
in more than 190 countries [11]. Among the four mentioned operating systems above,
Android is the only open-source system. It is developed on the Linux kernel, providing
possibilities for the developers to perform lower level implementation [11].
Figure 21. The smartphone OS Market Share copied from Android Software Develop-
ment Kit (2014)[11]
17
Android OS consists of four major layers. There are four different layers in the Android
stack (as shown in Figure 21):
l The Application Layer as the topmost layer interacts with system users.
l The Application Framework Layer provides the application developing access.
l The Libraries Layer includes a group of C/C++ libraries, primarily relating to the
graph, audio processing and data saving.
l The Android Runtime Layer is located in the same layer as libraries layer. It in-
cludes core Java libraries to enable programmers develop android application us-
ing Java programming language.
l The Linux Kernel Layer is the core of the system. It acts as a hardware abstraction
at the bottom of the Android stack.
Such a stack structure design ensures the loose coupling between layers. When an
update happens to a lower layer, the upper application layer will not need to be
changed.
Figure 21. Android System Architecture Android Software Development Kit (2014)[11]
18
Since Tesseract is implemented in C++, Android NDK (Native Development Kit) will be
needed to encapsulate the native code as Java API, so that Tesseract functions can be
called from the Android application.
As shown in Figure 22, the rectangle in the middle is the words choosing area. The
recognized result is on the left upper corner. .
At the final step of the application workflow, words need to be translated into the ex-
pected language automatically through the network. Both Google Translate and Mi-
crosoft Bing provide machine translation service. The only reason I chose Bing was
that it was free into use the translation service. Microsoft does not provide the service
in Java language. However, there is a Bing Translator Java wrapper project called mi-
crosoft-translator-java-api held on Google Code. The Microsoft translation service de-
mands an access token, so users need to register the translation service from Azure
Datamarket (http://datamarket.azure.com).
Since the tess-two project uses JNI (Java Native Interface) to implement the Tesseract
C++ interface, before referencing tess-two, it will be necessary to use the NDK tool to
build the project. Figure 23 shows how to build the tess-two project by using the NDK
command in the shell.
By building the project, NDK creates the native library in the lib/armbi and libs/armabi-
v7a directories [12]. The library files are built as .so format, as shown in Figure 23.
20
During the Android development process, in the case there is more than one developer
in the developing team, or one developer needs to take advantage of others work. The
way of code reuse is necessary to be taken into consideration. If the common-used
module is only simply copied to team members projects, any small modification to-
wards the common-used module will bring trouble to the team project combination and
synchronization.
There are two ways of implementing code reuse in Android development. One way is
to adopt one project as main and export the rest as jar files. Then add those external
jar files to the build path. The jar wrapping sets the code invisible. In order to use the
Bing Translator API, download the microsoft-translator-java-api-0.6-mod.jar file from
the Google code holder and import the library file by right clicking the project, then
choose properties -> Java build path -> Add External JARs ,as shown in Figure 25.
21
The other way is to set the referenced project as a library project. To import the tess-
two project, I adopted the second way. To do that, firstly right click the tess-two project
in the Eclipse project explorer and click properties. Choose the Android tag and check
the is library option, as shown in Figure 25.
To reference the tess-two project in the main project, right-click the flash card project
and click properties, as shown in Figure 26.
The code pattern was designed in a way that Models and Views were strictly isolated.
Each package acts in a different role. The relationships are illustrated in the following
class diagram Figure 27 and Figure 28. Figure 27 demonstrates the class implemented
for the views at every use case stage.
23
When users enter the application, PageActivity with three fragments contained is cre-
ated.
Figure 28 describes the data structure containing words, word lists and translations of
words and their implementations.
The original intention of building an OCR Flash Card is to improve the user experience
when people reciting words and interacting with the application. Other than a faster
way of input, building a clear user interface with the smooth learning curve becomes
another critical topic for the project. In the perspective of designing, nowadays various
applications are built in a way that is over task-focused and simply with features piled
up. This section discusses the user interface design of the flash card application.
In the user interface, there are four primary views in the application: camera view, edit
view, browser view and book view in the flash card application. When the application
starts, a user can add a set of words by means of the OCR (tapping the camera button)
or by means of manual typing (tapping the plus button), as shown in Figure 29. Figure
30 shows how the application implements the OCR function in the camera view.
In the camera view, one can OCR text and then see them in the edit view. If doesn't
want to use OCR, one can manually input words in the edit-view. Whenever one use
the application, one has to create a container (=list) for a set of words.
After the strings are captured, the strings will be segmented as words into the rows of a
list. A user needs to select the expected words and type a name for the list of words.
They can be browsed in the 'browse' view. By having named lists, the user can gather
meaningful sets of words and utilize them later on. By tapping the list, the words show
up in the book view. Their corresponding translations are shown in the unfolded items
by tapping the words for the words reciting purpose, as shown in Figure 31. A flash
card (metaphorically) is a card where on one side there is a word, and on the other side
the translation of the word. In the book view, the user can hide the translation for learn-
ing the language.
To implement the user interface, I adopted the Android ViewPager, which was used in
conjunction with three fragments. As the standard adapter, the FragmentPagerAdapter
class was adopted for using fragments with the ViewPager [12]. Rather than building all
the views with Activity, this way provided convenience to supplying and managing the
life cycle of each page [12].
26
An Activity containing fragments has a direct impact on the fragments life cycle that it
owns. Every callback to the Activity life cycle leads to a callback to the fragments life
cycle. For example, when the Activity receives the onPause() callback, every fragment
contained in the Activity will also receive the on Pause() callback. There are a few more
callback functions in the fragment life cycle than in the Activity for the purpose of inter-
acting with the Activity, generating and destroying the fragment UI. Figure 32 shows
the relationship between the Activity life cycle and the Fragment life cycle.
Figure 32. The Activity life cycle directly impacting on the Fragment life cycle
27
The fragment is not inherited from the View. It has its own life cycle, which makes it
easy to manage when it is necessary to change its status. On the other hand, the Activ-
ity is independent of the Fragment life cycle. Using PageView with fragments improves
the reusability and extensibility of the code.
ID Name
ID Book_id Word
ID Word Translation
The SQLite database was adopted to store the word list, words and their translations in
this project. The table 1,2 and 3 above illustrate the saved words data structure and
their dependency.
There are two main Activities built for the project. One was the PageView Activity for
wrapping the words related view. The other was the camera view activity called Cap-
28
tureActivity, which was built in the Tess-two project. To transfer the captured string
data from one activity to another, Intent was adopted.
An Intent is an object that provides runtime binding between separate components [14].
By calling putExtra() function, the result string is extended to the intent. Then I called
the setResult() function to return to its caller Activity, which is the ViewPager with the
intent transferred back. From the side of PagerView, the onActivityResult() function is
called back when the camera view Activity, PagerView, launched exits. Intent is re-
turned as the third parameter of the callback function. Finally I called getStringExtra()
function to get the data, which is bound to the intent.
Once the OCR data is transferred to the ViewPager successfully, the workflow of the
flash card will be initialized, as mentioned in section 4.4.1.
29
5 Project Evaluation
The quality of OCR was primarily evaluated by manual testing. The test devices include
LG G2 and Samsung Nexus 4 Android mobile. By applying the Tesseract OCR engine
in the mobile camera character input, a few drawbacks emerged in different test condi-
tions.
An ideal OCR object is that the word list is printed on a plate piece of paper with en-
gine-familiar font that has been trained or stored. When opening a book, part of the
paper close to the middle gutter line will become twisted. The twisted paper leads to
distorted words, which lower the accuracy.
For the limitation of the mobile screen, the rectangle selection area may not wrap all
the expected words in a whole line or a whole bar. The target paragraph may appear in
some polygon other than the rectangle. Therefore, unrelated information may also be
adopted. Other than the two most notable defects, recognition output is fast generated
and accurate.
In this case, another way of recognizing words in OCR can be reasonably designed.
First, build an activity similar to a drawing application with transparent background. Set
the drawing track thick enough to wrap the words. The drawing track is regarded as the
selection area instead of the rectangle. Cut the image of the selection from the back-
ground. Then deploy the rest of the work to the Tesseract engine. Android Canvas
could be adopted to implement the drawing function.
Since the computer vision such as optical character recognizing task is still quite heavy
for mobile processors nowadays, running OCR as a cloud-service on a remote server
30
could be another further development plan. In this case, Image would not be processed
locally, users would need a network to upload the captured image and only the string
from the image would be returned.
31
6 Conclusions
With decades of development and the emergence of new conferences, the OCR tech-
nique has made remarkable progress. Moreover, the computers speed is much faster
than before. The way computers learning and getting information gravitates towards
stimulating human beings. Optical character recognition is a technology that not only
lowers the cost of manual labor with a faster computer input solution, but also provides
users better user experiences.
In this project, OCR technology was applied to build a flash card Android application for
memorizing new words. Users should be able to recognize and record the words with-
out Internet access. The project was started in January 2014, and was finished at the
beginning of March 2014. After two months of effort, the application was successfully
built and it ran as expected.
The goal of the project was successfully accomplished, since the primary stages of the
flash card workflow were implemented, and it bringing users better experience of se-
lecting new words. On the current form of the application, considering it was developed
the application within two months, I am satisfied with the project achievement.
32
References
5. Cabebe Jamar. Google Translate for Android adds OCR [online]; 2014.
URL: http://www.cnet.com/news/google-translate-for-android-adds-ocr/ Accessed
10 April 2014
10. Mohul G. Anroid Has 82% Market Share [online]; November 2013.
URL: http://trak.in/tags/business/2013/11/15/smartphone-market-share-3q2013-
lenovo-
3rdandroid/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed
%3A+trakin+%28India+Business+Blog+%21%29 Accessed 10 April 2014
11. GOOGLE INC. (Hrsg.): Android Software Development Kit(SDK) [online]; 2014.
URL: http://developer.android.com/sdk/index.html. Accessed 10 April 2014