Beruflich Dokumente
Kultur Dokumente
using VietOCR as the GUI frontend for the Tesseract OCR 3.02 engine
Tesseract is probably the most accurate open source OCR engine available. It was developed at HP
Labs between 1985 and 1995... and now at Google. Version 3.01 added support for Hindi and in
version 3.02 Hindi recognition was further improved. T
VietOCR, available in Java and .NET executable, is a GUI frontend for Tesseract OCR engine. V
Both versions sport similar graphic user interface and are capable of recognizing text from images
of common formats. The program can also function as a console application, executing from the
command line.
Its features include:
VietOCR Usage Page VU gives detailed information about download, installation and usage. Please
read this page fully to get an overview of the features and functionality.
To use VietOCR to OCR images with Hindi text, please follow the following instructions:
1. Download VietOCR
VietOCR is available in two versions, .net and java, please download the one of your choice. The
Java version requires Java Runtime Environment 6.0 or later (installation instructions). The .NET
version requires Microsoft .NET Framework 2.0 Redistributable.
http://sourceforge.net/projects/vietocr/files/vietocr/3.4.3%20Beta/VietOCR-3.4.3Beta4.zip/download
http://sourceforge.net/projects/vietocr/files/vietocr.net/3.4.1%20Beta/VietOCR.NET-3.4.1Beta4.zip/download
T http://code.google.com/p/tesseract-ocr/
V http://sourceforge.net/projects/vietocr/
VU http://vietocr.sourceforge.net/usage.html
http://goo.gl/IMspZ
https://skydrive.live.com/#cid=60EACE63E15A752A&id=60EACE63E15A752A%21113
If you have any custom dictionaries defined, you can use those wordlists in user.dic.
4. Install VietOCR.net
7. Run VietOCR.NET
Test to verify that the program is working by OCRing an image with English text.
You can open an image file or copy and paste an image in the program.
Check that the OCR Language in the dropdown menu on the right says 'English'
Status bar on bottom left will show OCR Running and change to OCR completed when
done.
Right Click on OCRed text area to bring up menu to Select All, Cut, Copy etc.
Click on the various menus and icons to familiarize yourself with the options
Open an image with Hindi text using file open or copy and paste
You can Rotate, Zoom, Fit image using the icons on menu bar on left
Status bar will show loading image it may take some time depending on size of file
Command Menu has option to OCR current page or to OCR all pages in tiff
It will create working images from the pdf and then do the OCR
It does not allow choosing page range for loading in program, so if you need only a few
pages from a large pdf, make another pdf with just those pages to speed up processing.
Save output.
Bulk OCR option can be used to OCR a large number of files in a batch mode
You can check on the output in the files generated in the OUTPUT folder
If you notice any consistent errors in the OCRed output you can setup a substitution table to
correct those using DangAmbigs.txt file and postproces using VIETOCR,
In order for this to work, you have a to create a text file called hin.DangAmbigs.txt in the
data directory and add the required substitutions to it.
For example if you notice that is being OCREd as , you should add the following
Check Enable
Now after you OCR a page, use Command - Postprocess and the substitution will be
applied to the OCRed text.