OCRHindi Using VietOCR and Tesseract PDF

How to OCR Hindi text
using VietOCR as the GUI frontend for the Tesseract OCR 3.02 engine
Tesseract is probably the most accurate open source OCR engine available. It was developed at HP
Labs between 1985 and 1995... and now at Google. Version 3.01 added support for Hindi and in
version 3.02 Hindi recognition was further improved. T
VietOCR, available in Java and .NET executable, is a GUI frontend for Tesseract OCR engine. V
Both versions sport similar graphic user interface and are capable of recognizing text from images
of common formats. The program can also function as a console application, executing from the
command line.
Its features include:
Java & .NET GUI frontends for Tesseract OCR engine

Supports all languages provided by Tesseract
Supports automatic download and installation of language packs
PDF, TIFF, JPEG, GIF, PNG, BMP image formats
Paste image from clipboard
Selection box for Region of Interest (ROI)
File drag-and-drop
Bulk & batch operations
Text replacement postprocessing
Integrated scanning support
Spellcheck with Hunspell
VietOCR Usage Page VU gives detailed information about download, installation and usage. Please
read this page fully to get an overview of the features and functionality.
To use VietOCR to OCR images with Hindi text, please follow the following instructions:
1. Download VietOCR
VietOCR is available in two versions, .net and java, please download the one of your choice. The
Java version requires Java Runtime Environment 6.0 or later (installation instructions). The .NET
version requires Microsoft .NET Framework 2.0 Redistributable.
.Net Version is available at

http://sourceforge.net/projects/vietocr/files/vietocr.net/
Java version is available at

http://sourceforge.net/projects/vietocr/files/vietocr/
Choose the latest versions, currently these are:
http://sourceforge.net/projects/vietocr/files/vietocr/3.4.3%20Beta/VietOCR-3.4.3Beta4.zip/download
http://sourceforge.net/projects/vietocr/files/vietocr.net/3.4.1%20Beta/VietOCR.NET-3.4.1Beta4.zip/download
T http://code.google.com/p/tesseract-ocr/
V http://sourceforge.net/projects/vietocr/
VU http://vietocr.sourceforge.net/usage.html
2. Tesseract 3.02 Traineddata for Hindi

Language data for Vietnamese and English is already bundled with the program. Data for other
languages can be downloaded from Tesseract website and should be placed into tessdata
folder. VietOCR has added support for downloading and installing language data packs.
Official Hindi Traineddata is available from Tesseract's download page at

http://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr3.02.hin.tar.gz&can=2&q=
3. Hindi Dictionary Files

Spellcheck functionality is available through Hunspell, whose dictionary files (.aff, .dic)
should be placed in dict folder of VietOCR.
user.dic is an UTF-8-encoded file which contains a list of custom words, one word per line.
For Hindi the dictionary files will be named as hi_IN.dic and hi_IN.aff.
A larger Hindi dictionary is linked from http://raviratlami.blogspot.in/2012/10/blog-post.html. It
can be downloaded from:
http://goo.gl/IMspZ
https://skydrive.live.com/#cid=60EACE63E15A752A&id=60EACE63E15A752A%21113
If you have any custom dictionaries defined, you can use those wordlists in user.dic.
4. Install VietOCR.net
Unzip VietOCR.NET-3.4.1-Beta3.zip (or newer version) from your downloads
Click on Setup.exe to install the software
Select Installation Directory e.g. C:\Program Files (x86)\VietOCR.NET
Complete the installation
7. Run VietOCR.NET
Click on VIETOCR.exe in VIETOCR.NET directory

or
Click on VIETOCR.NET under All Programs to start the program
You should see the following window come up
Test to verify that the program is working by OCRing an image with English text.
You can open an image file or copy and paste an image in the program.
Check that the OCR Language in the dropdown menu on the right says 'English'
Click on the OCR button to start OCR
Status bar on bottom left will show OCR Running and change to OCR completed when
done.
Clicking Eraser icon will erase the OCRed text
Clicking on ABC icon will run spellcheck on the OCRed text
Right Click on OCRed text area to bring up menu to Select All, Cut, Copy etc.
Click on the various menus and icons to familiarize yourself with the options
5. Copy Hindi Language Data and Dictionary
8. OCR Hindi text part of an image
Choose Hindi as the OCR Language in the dropdown menu
Open an image with Hindi text using file open or copy and paste
Select a portion of the image using the mouse
Click on OCR button
You can Rotate, Zoom, Fit image using the icons on menu bar on left
9. OCR a multipage tiff
Open the mutipage tif file
Status bar will show loading image it may take some time depending on size of file
Menu Bar on left has arrows for page navigation
Command Menu has option to OCR current page or to OCR all pages in tiff
Click on OCR page to OCR the current page
Wait for OCR to complete large files will take time.
10. OCR a pdf
VietOCR supports pdf files using ghostscript.
It will create working images from the pdf and then do the OCR
It does not allow choosing page range for loading in program, so if you need only a few
pages from a large pdf, make another pdf with just those pages to speed up processing.
Open the pdf file in VietOCR.NET
Allow for file loading to complete
Page navigation arraows can be used similar to multi-page tiff
Status on top of image shows Page # of ##
OCR one page or all pages
Save output.
11. Bulk OCR
Bulk OCR option can be used to OCR a large number of files in a batch mode
Put the images to be OCRed in a separate folder
Create a new folder for the OCRed text
Choose Bulk OCR from Commands menu
Choose the BulkOCR and Output directory in the dialog box
Check HOCR option if you want the output as HTML pages
Leave option unchecked for text output
Click on RUN to start the Bulk OCR process
A console window will come up showing the progress of the batch.
You can check on the output in the files generated in the OUTPUT folder
Use command Cancel Bulk OCR to cancel batch.
12. Post Processing
If you notice any consistent errors in the OCRed output you can setup a substitution table to
correct those using DangAmbigs.txt file and postproces using VIETOCR,
In order for this to work, you have a to create a text file called hin.DangAmbigs.txt in the
data directory and add the required substitutions to it.
You also have to enable the postprocess option under settings.
For example if you notice that is being OCREd as , you should add the following
entry to hin.DangAmbigs.txt file and save the file.

=
Added space after to ensure that it is being changed only when it is at the end of the
word.
Settings Options DangAmbigs.txt
Browse to the Data subfolder in VIETOCR.NET and choose hin.DangAmbigs.txt
Check Enable
Now after you OCR a page, use Command - Postprocess and the substitution will be
applied to the OCRed text.
Add more substitutions as required.

OCRHindi Using VietOCR and Tesseract PDF

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

OCRHindi Using VietOCR and Tesseract PDF

Hochgeladen von

Copyright:

Verfügbare Formate

How to OCR Hindi text

Java & .NET GUI frontends for Tesseract OCR engine

.Net Version is available at

Java version is available at

Choose the latest versions, currently these are:

2. Tesseract 3.02 Traineddata for Hindi

Official Hindi Traineddata is available from Tesseract's download page at

3. Hindi Dictionary Files

Unzip VietOCR.NET-3.4.1-Beta3.zip (or newer version) from your downloads

Click on Setup.exe to install the software

Select Installation Directory e.g. C:\Program Files (x86)\VietOCR.NET

Complete the installation

Click on VIETOCR.exe in VIETOCR.NET directory

You should see the following window come up

Click on the OCR button to start OCR

Clicking Eraser icon will erase the OCRed text

Clicking on ABC icon will run spellcheck on the OCRed text

5. Copy Hindi Language Data and Dictionary

8. OCR Hindi text part of an image

Choose Hindi as the OCR Language in the dropdown menu

Select a portion of the image using the mouse

Click on OCR button

9. OCR a multipage tiff

Open the mutipage tif file

Menu Bar on left has arrows for page navigation

Click on OCR page to OCR the current page

Wait for OCR to complete large files will take time.

10. OCR a pdf

VietOCR supports pdf files using ghostscript.

Open the pdf file in VietOCR.NET

Allow for file loading to complete

Page navigation arraows can be used similar to multi-page tiff

Status on top of image shows Page # of ##

OCR one page or all pages

11. Bulk OCR

Put the images to be OCRed in a separate folder

Create a new folder for the OCRed text

Choose Bulk OCR from Commands menu

Choose the BulkOCR and Output directory in the dialog box

Check HOCR option if you want the output as HTML pages

Leave option unchecked for text output

Click on RUN to start the Bulk OCR process

A console window will come up showing the progress of the batch.

Use command Cancel Bulk OCR to cancel batch.

12. Post Processing

You also have to enable the postprocess option under settings.

entry to hin.DangAmbigs.txt file and save the file.

Settings Options DangAmbigs.txt

Browse to the Data subfolder in VIETOCR.NET and choose hin.DangAmbigs.txt

Add more substitutions as required.

Das könnte Ihnen auch gefallen