Anda di halaman 1dari 7

How to OCR Hindi text

using VietOCR as the GUI frontend for the Tesseract OCR 3.02 engine

“Tesseract is probably the most accurate open source OCR engine available. It was developed at HP
Labs between 1985 and 1995... and now at Google.” Version 3.01 added support for Hindi and in
version 3.02 Hindi recognition was further improved. T
“VietOCR, available in Java and .NET executable, is a GUI frontend for Tesseract OCR engine. V
Both versions sport similar graphic user interface and are capable of recognizing text from images
of common formats. The program can also function as a console application, executing from the
command line.
Its features include:
• Java & .NET GUI frontends for Tesseract OCR engine
• Supports all languages provided by Tesseract
• Supports automatic download and installation of language packs
• PDF, TIFF, JPEG, GIF, PNG, BMP image formats
• Paste image from clipboard
• Selection box for Region of Interest (ROI)
• File drag-and-drop
• Bulk & batch operations
• Text replacement postprocessing
• Integrated scanning support
• Spellcheck with Hunspell “
VietOCR Usage Page VU gives detailed information about download, installation and usage. Please
read this page fully to get an overview of the features and functionality.
To use VietOCR to OCR images with Hindi text, please follow the following instructions:

1. Download VietOCR
VietOCR is available in two versions, .net and java, please download the one of your choice. The
Java version requires Java Runtime Environment 6.0 or later (installation instructions). The .NET
version requires Microsoft .NET Framework 2.0 Redistributable.
• .Net Version is available at
http://sourceforge.net/projects/vietocr/files/vietocr.net/
• Java version is available at
http://sourceforge.net/projects/vietocr/files/vietocr/
Choose the latest versions, currently these are:
• http://sourceforge.net/projects/vietocr/files/vietocr/3.4.3%20Beta/VietOCR-3.4.3-
Beta4.zip/download
• http://sourceforge.net/projects/vietocr/files/vietocr.net/3.4.1%20Beta/VietOCR.NET-3.4.1-
Beta4.zip/download
T http://code.google.com/p/tesseract-ocr/
V http://sourceforge.net/projects/vietocr/
VU http://vietocr.sourceforge.net/usage.html
2. Tesseract 3.02 Traineddata for Hindi
“Language data for Vietnamese and English is already bundled with the program. Data for other
languages can be downloaded from Tesseract website and should be placed into tessdata
folder.” VietOCR has added support for downloading and installing language data packs.
• Official Hindi Traineddata is available from Tesseract's download page at
http://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-
3.02.hin.tar.gz&can=2&q=

3. Hindi Dictionary Files


“Spellcheck functionality is available through Hunspell, whose dictionary files (.aff, .dic)
should be placed in dict folder of VietOCR.
user.dic is an UTF-8-encoded file which contains a list of custom words, one word per line.”
For Hindi the dictionary files will be named as hi_IN.dic and hi_IN.aff.
A larger Hindi dictionary is linked from http://raviratlami.blogspot.in/2012/10/blog-post.html. It
can be downloaded from:
• http://goo.gl/IMspZ
• https://skydrive.live.com/#cid=60EACE63E15A752A&id=60EACE63E15A752A%21113
If you have any custom dictionaries defined, you can use those wordlists in user.dic.

4. Install VietOCR.net
• Unzip VietOCR.NET-3.4.1-Beta3.zip (or newer version) from your downloads
• Click on Setup.exe to install the software
• Select Installation Directory e.g. C:\Program Files (x86)\VietOCR.NET
• Complete the installation

7. Run VietOCR.NET
• Click on VIETOCR.exe in VIETOCR.NET directory
or
Click on VIETOCR.NET under All Programs to start the program
• You should see the following window come up

• Test to verify that the program is working by OCRing an image with English text.
• You can open an image file or copy and paste an image in the program.
• Check that the OCR Language in the dropdown menu on the right says 'English'
• Click on the OCR button to start OCR

• Status bar on bottom left will show OCR Running and change to OCR completed when
done.
• Clicking Eraser icon will erase the OCRed text
• Clicking on ABC icon will run spellcheck on the OCRed text
• Right Click on OCRed text area to bring up menu to Select All, Cut, Copy etc.
• Click on the various menus and icons to familiarize yourself with the options

5. Copy Hindi Language Data and Dictionary


8. OCR Hindi text – part of an image


• Choose Hindi as the OCR Language in the dropdown menu
• Open an image with Hindi text using file open or copy and paste
• Select a portion of the image using the mouse
• Click on OCR button
• You can Rotate, Zoom, Fit image using the icons on menu bar on left

9. OCR a multipage tiff


• Open the mutipage tif file

• Status bar will show loading image – it may take some time depending on size of file
• Menu Bar on left has arrows for page navigation
• Command Menu has option to OCR current page or to OCR all pages in tiff
• Click on OCR page to OCR the current page
• Wait for OCR to complete – large files will take time.

10. OCR a pdf


• VietOCR supports pdf files using ghostscript.
• It will create working images from the pdf and then do the OCR
• It does not allow choosing page range for loading in program, so if you need only a few
pages from a large pdf, make another pdf with just those pages to speed up processing.
• Open the pdf file in VietOCR.NET
• Allow for file loading to complete
• Page navigation arraows can be used similar to multi-page tiff
• Status on top of image shows Page # of ##

• OCR one page or all pages


• Save output.

11. Bulk OCR


• Bulk OCR option can be used to OCR a large number of files in a batch mode
• Put the images to be OCRed in a separate folder
• Create a new folder for the OCRed text
• Choose Bulk OCR from Commands menu
• Choose the BulkOCR and Output directory in the dialog box
• Check HOCR option if you want the output as HTML pages
• Leave option unchecked for text output
• Click on RUN to start the Bulk OCR process
• A console window will come up showing the progress of the batch.

• You can check on the output in the files generated in the OUTPUT folder
• Use command Cancel Bulk OCR to cancel batch.

12. Post Processing


• If you notice any consistent errors in the OCRed output you can setup a substitution table to
correct those using DangAmbigs.txt file and postproces using VIETOCR,
• In order for this to work, you have a to create a text file called hin.DangAmbigs.txt in the
data directory and add the required substitutions to it.
• You also have to enable the postprocess option under settings.
• For example if you notice that को is being OCREd as को, you should add the following
entry to hin.DangAmbigs.txt file and save the file.
को =को
Added space after को to ensure that it is being changed only when it is at the end of the
word.

• Settings – Options – DangAmbigs.txt


• Browse to the Data subfolder in VIETOCR.NET and choose hin.DangAmbigs.txt
• Check Enable
• Now after you OCR a page, use Command - Postprocess and the substitution will be
applied to the OCRed text.
• Add more substitutions as required.

Anda mungkin juga menyukai