using VietOCR as the GUI frontend for the Tesseract OCR 3.02 engine
“Tesseract is probably the most accurate open source OCR engine available. It was developed at HP
Labs between 1985 and 1995... and now at Google.” Version 3.01 added support for Hindi and in
version 3.02 Hindi recognition was further improved. T
“VietOCR, available in Java and .NET executable, is a GUI frontend for Tesseract OCR engine. V
Both versions sport similar graphic user interface and are capable of recognizing text from images
of common formats. The program can also function as a console application, executing from the
command line.
Its features include:
• Java & .NET GUI frontends for Tesseract OCR engine
• Supports all languages provided by Tesseract
• Supports automatic download and installation of language packs
• PDF, TIFF, JPEG, GIF, PNG, BMP image formats
• Paste image from clipboard
• Selection box for Region of Interest (ROI)
• File drag-and-drop
• Bulk & batch operations
• Text replacement postprocessing
• Integrated scanning support
• Spellcheck with Hunspell “
VietOCR Usage Page VU gives detailed information about download, installation and usage. Please
read this page fully to get an overview of the features and functionality.
To use VietOCR to OCR images with Hindi text, please follow the following instructions:
1. Download VietOCR
VietOCR is available in two versions, .net and java, please download the one of your choice. The
Java version requires Java Runtime Environment 6.0 or later (installation instructions). The .NET
version requires Microsoft .NET Framework 2.0 Redistributable.
• .Net Version is available at
http://sourceforge.net/projects/vietocr/files/vietocr.net/
• Java version is available at
http://sourceforge.net/projects/vietocr/files/vietocr/
Choose the latest versions, currently these are:
• http://sourceforge.net/projects/vietocr/files/vietocr/3.4.3%20Beta/VietOCR-3.4.3-
Beta4.zip/download
• http://sourceforge.net/projects/vietocr/files/vietocr.net/3.4.1%20Beta/VietOCR.NET-3.4.1-
Beta4.zip/download
T http://code.google.com/p/tesseract-ocr/
V http://sourceforge.net/projects/vietocr/
VU http://vietocr.sourceforge.net/usage.html
2. Tesseract 3.02 Traineddata for Hindi
“Language data for Vietnamese and English is already bundled with the program. Data for other
languages can be downloaded from Tesseract website and should be placed into tessdata
folder.” VietOCR has added support for downloading and installing language data packs.
• Official Hindi Traineddata is available from Tesseract's download page at
http://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-
3.02.hin.tar.gz&can=2&q=
4. Install VietOCR.net
• Unzip VietOCR.NET-3.4.1-Beta3.zip (or newer version) from your downloads
• Click on Setup.exe to install the software
• Select Installation Directory e.g. C:\Program Files (x86)\VietOCR.NET
• Complete the installation
7. Run VietOCR.NET
• Click on VIETOCR.exe in VIETOCR.NET directory
or
Click on VIETOCR.NET under All Programs to start the program
• You should see the following window come up
• Test to verify that the program is working by OCRing an image with English text.
• You can open an image file or copy and paste an image in the program.
• Check that the OCR Language in the dropdown menu on the right says 'English'
• Click on the OCR button to start OCR
• Status bar on bottom left will show OCR Running and change to OCR completed when
done.
• Clicking Eraser icon will erase the OCRed text
• Clicking on ABC icon will run spellcheck on the OCRed text
• Right Click on OCRed text area to bring up menu to Select All, Cut, Copy etc.
• Click on the various menus and icons to familiarize yourself with the options
• Status bar will show loading image – it may take some time depending on size of file
• Menu Bar on left has arrows for page navigation
• Command Menu has option to OCR current page or to OCR all pages in tiff
• Click on OCR page to OCR the current page
• Wait for OCR to complete – large files will take time.
• You can check on the output in the files generated in the OUTPUT folder
• Use command Cancel Bulk OCR to cancel batch.