Sie sind auf Seite 1von 7

How to OCR Hindi text

using VietOCR as the GUI frontend for the Tesseract OCR 3.02 engine
Tesseract is probably the most accurate open source OCR engine available. It was developed at HP
Labs between 1985 and 1995... and now at Google. Version 3.01 added support for Hindi and in
version 3.02 Hindi recognition was further improved. T
VietOCR, available in Java and .NET executable, is a GUI frontend for Tesseract OCR engine. V
Both versions sport similar graphic user interface and are capable of recognizing text from images
of common formats. The program can also function as a console application, executing from the
command line.
Its features include:

Java & .NET GUI frontends for Tesseract OCR engine


Supports all languages provided by Tesseract
Supports automatic download and installation of language packs
PDF, TIFF, JPEG, GIF, PNG, BMP image formats
Paste image from clipboard
Selection box for Region of Interest (ROI)
File drag-and-drop
Bulk & batch operations
Text replacement postprocessing
Integrated scanning support
Spellcheck with Hunspell

VietOCR Usage Page VU gives detailed information about download, installation and usage. Please
read this page fully to get an overview of the features and functionality.
To use VietOCR to OCR images with Hindi text, please follow the following instructions:

1. Download VietOCR
VietOCR is available in two versions, .net and java, please download the one of your choice. The
Java version requires Java Runtime Environment 6.0 or later (installation instructions). The .NET
version requires Microsoft .NET Framework 2.0 Redistributable.

.Net Version is available at


http://sourceforge.net/projects/vietocr/files/vietocr.net/

Java version is available at


http://sourceforge.net/projects/vietocr/files/vietocr/

Choose the latest versions, currently these are:

http://sourceforge.net/projects/vietocr/files/vietocr/3.4.3%20Beta/VietOCR-3.4.3Beta4.zip/download

http://sourceforge.net/projects/vietocr/files/vietocr.net/3.4.1%20Beta/VietOCR.NET-3.4.1Beta4.zip/download

T http://code.google.com/p/tesseract-ocr/
V http://sourceforge.net/projects/vietocr/
VU http://vietocr.sourceforge.net/usage.html

2. Tesseract 3.02 Traineddata for Hindi


Language data for Vietnamese and English is already bundled with the program. Data for other
languages can be downloaded from Tesseract website and should be placed into tessdata
folder. VietOCR has added support for downloading and installing language data packs.

Official Hindi Traineddata is available from Tesseract's download page at


http://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr3.02.hin.tar.gz&can=2&q=

3. Hindi Dictionary Files


Spellcheck functionality is available through Hunspell, whose dictionary files (.aff, .dic)
should be placed in dict folder of VietOCR.
user.dic is an UTF-8-encoded file which contains a list of custom words, one word per line.
For Hindi the dictionary files will be named as hi_IN.dic and hi_IN.aff.
A larger Hindi dictionary is linked from http://raviratlami.blogspot.in/2012/10/blog-post.html. It
can be downloaded from:

http://goo.gl/IMspZ

https://skydrive.live.com/#cid=60EACE63E15A752A&id=60EACE63E15A752A%21113

If you have any custom dictionaries defined, you can use those wordlists in user.dic.

4. Install VietOCR.net

Unzip VietOCR.NET-3.4.1-Beta3.zip (or newer version) from your downloads

Click on Setup.exe to install the software

Select Installation Directory e.g. C:\Program Files (x86)\VietOCR.NET

Complete the installation

7. Run VietOCR.NET

Click on VIETOCR.exe in VIETOCR.NET directory


or
Click on VIETOCR.NET under All Programs to start the program

You should see the following window come up

Test to verify that the program is working by OCRing an image with English text.

You can open an image file or copy and paste an image in the program.

Check that the OCR Language in the dropdown menu on the right says 'English'

Click on the OCR button to start OCR

Status bar on bottom left will show OCR Running and change to OCR completed when
done.

Clicking Eraser icon will erase the OCRed text

Clicking on ABC icon will run spellcheck on the OCRed text

Right Click on OCRed text area to bring up menu to Select All, Cut, Copy etc.

Click on the various menus and icons to familiarize yourself with the options

5. Copy Hindi Language Data and Dictionary

8. OCR Hindi text part of an image

Choose Hindi as the OCR Language in the dropdown menu

Open an image with Hindi text using file open or copy and paste

Select a portion of the image using the mouse

Click on OCR button

You can Rotate, Zoom, Fit image using the icons on menu bar on left

9. OCR a multipage tiff

Open the mutipage tif file

Status bar will show loading image it may take some time depending on size of file

Menu Bar on left has arrows for page navigation

Command Menu has option to OCR current page or to OCR all pages in tiff

Click on OCR page to OCR the current page

Wait for OCR to complete large files will take time.

10. OCR a pdf

VietOCR supports pdf files using ghostscript.

It will create working images from the pdf and then do the OCR

It does not allow choosing page range for loading in program, so if you need only a few
pages from a large pdf, make another pdf with just those pages to speed up processing.

Open the pdf file in VietOCR.NET

Allow for file loading to complete

Page navigation arraows can be used similar to multi-page tiff

Status on top of image shows Page # of ##

OCR one page or all pages

Save output.

11. Bulk OCR

Bulk OCR option can be used to OCR a large number of files in a batch mode

Put the images to be OCRed in a separate folder

Create a new folder for the OCRed text

Choose Bulk OCR from Commands menu

Choose the BulkOCR and Output directory in the dialog box

Check HOCR option if you want the output as HTML pages

Leave option unchecked for text output

Click on RUN to start the Bulk OCR process

A console window will come up showing the progress of the batch.

You can check on the output in the files generated in the OUTPUT folder

Use command Cancel Bulk OCR to cancel batch.

12. Post Processing

If you notice any consistent errors in the OCRed output you can setup a substitution table to
correct those using DangAmbigs.txt file and postproces using VIETOCR,

In order for this to work, you have a to create a text file called hin.DangAmbigs.txt in the
data directory and add the required substitutions to it.

You also have to enable the postprocess option under settings.

For example if you notice that is being OCREd as , you should add the following

entry to hin.DangAmbigs.txt file and save the file.


=
Added space after to ensure that it is being changed only when it is at the end of the
word.

Settings Options DangAmbigs.txt

Browse to the Data subfolder in VIETOCR.NET and choose hin.DangAmbigs.txt

Check Enable

Now after you OCR a page, use Command - Postprocess and the substitution will be
applied to the OCRed text.

Add more substitutions as required.

Das könnte Ihnen auch gefallen