Tesseract for text recognition
Tesseract is the open source tool for text recognition.
Depending on platform there can be different options. Some unixes offer it as a readily installable packages, or via binary installation or you can build it from source code.
Text recognition for 1 page
Check the command line help, but in most simplest form, you give it an input image and get and output of text recognition. Do note that page layout can have impact on how well Tesseract recognizes certain page image.
tesseract -l rus 1014572_1.jpeg 1014572_1
For all files in a directory
When you have your input images in a folder called
data and tesseract program in the path and folder called
out for output texts, the following runs tesseract for each file and stores the results to a text file.
for %F in (data\*.jpeg) do tesseract -l rus %F out\%~nF_tes
NB! depending on page style, it might be useful to tweak the Tesseract command line options accordingly.