Tesseract for text recognition

Tesseract is the open source tool for text recognition.

Installing Tesseract

Depending on platform there can be different options. Some unixes offer it as a readily installable packages, or via binary installation or you can build it from source code.

Installing Tesseract Wiki

Text recognition for 1 page

Check the command line help, but in most simplest form, you give it an input image and get and output of text recognition. Do note that page layout can have impact on how well Tesseract recognizes certain page image.

tesseract -l rus 1014572_1.jpeg 1014572_1

For all files in a directory

When you have your input images in a folder called data and tesseract program in the path and folder called out for output texts, the following runs tesseract for each file and stores the results to a text file.

In Windows

for %F in (data\*.jpeg) do tesseract -l rus %F out\%~nF_tes

NB! depending on page style, it might be useful to tweak the Tesseract command line options accordingly.

results matching ""

    No results matching ""