Learning Tesseract OCR as a Data Scientist
I am a graduating Data Science student in Lambda School, now in labs, which is the final phase. In this article, I will share my learning experience using Tesseract OCR for handwriting recognition.
In our labs, we had an opportunity to work on a new app called “Story Squad” for students. Story Squad is the dream of a former teacher, Graig Peterson, to create opportunities for children to have creative writing and drawing time off-screen.
How it works: child users of the website are provided a new chapter in an ongoing story each weekend. They read the story and then follow both writing and drawing prompts to spend an hour off-screen writing and drawing. When they’re done, they upload photos of each, and this is where our data science team comes in. The stories are transcribed into text, analyzed for complexity, screened for inappropriate content, and then sent to a moderator. Once all submissions have been checked over on moderation day, our clustering algorithm groups the submissions by similar complexity and creates Squads of 4 to head into a game of assigning points and voting for the best submissions in head-to-head pairings within the cluster! Then it starts all over again the following weekend.
We collaborated with a couple of Web and IOS team. In our Data Science team, the problem we are trying to solve is the feature to improve the transcription using Tesseract OCR.
What is Tesseract?
Tesseract is an open-source text recognition (OCR) Engine. It is one of the powerful OCR alternatives on the market for quite some time. It can be used directly, or (for programmers) used as an API to extract printed text from images. It supports a wide variety of languages. Tesseract does not have a built-in GUI. The engine is compatible with many programming languages and frameworks through wrappers that can be found here. It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize the text from an image of a single text line.
By the way, what is OCR?
OCR = Optical Character Recognition. In other words, OCR systems transform a two-dimensional image of text, that could contain machine-printed or handwritten text from its image representation into machine-readable text.
Going back to Tesseract
For Tesseract OCR to obtain reasonable results, you will want to supply images that are cleanly pre-processed.
When utilizing Tesseract, I recommend:
- Using as an input image with as high resolution and DPI as possible.
- Applying thresholding to segment the text from the background.
- Ensuring the foreground is as clearly segmented from the background as possible (i.e., no pixelations or character deformations).
- Applying text skew correction to the input image to ensure the text is properly aligned.
Let’s begin the local installation
Tesseract and Leptonica: You will need a recent version(≥4.0.0beta1) of tesseract built with the training tools and matching leptonica binding. More information can be found in the Tesseract project wiki.
Alternatively, you can build leptonica and tesseract within this project and install them to a subdirectory ./usr
in the repo: make leptonica tesseract
.
Tesseract will be built from the git repository, which requires CMake, autotools (including autotools-archive), and some additional libraries for the training tools.
Python: You need a recent version of Python 3.x. For image processing, the Python library Pillow
is used. If you don't have a global installation, please use the provided requirements file pip install -r requirements.txt
.
Then you can choose a name for your model: By convention, Tesseract stack models including language-specific resources use (lowercase) three-letter codes defined in ISO 639 with additional information separated by an underscore. E.g., chi_tra_vert
for traditional Chinese with vertical typesetting. Language-independent (i.e. script-specific) models use the capitalized name of the script type as an identifier. E.g., Hangul_vert
for Hangul script with vertical typesetting. In the following, the model name is referenced by MODEL_NAME
.
Place ground truth consisting of line images and transcriptions in the folder data/MODEL_NAME-ground-truth
. This list of files will be split into training and evaluation data, the ratio is defined by the RATIO_TRAIN
variable.
Data
Data (jpg images of handwritten stories) is sourced and gathered into a created update/data/ directory in tesstrain (avoid pushing any image data to GitHub; it’s suggested to divide up images among teammates during editing).
Images must be TIFF and have the extension .tif
or PNG and have the extension .png
, .bin.png
or .nrm.png
.
Transcriptions must be single-line plain text and have the same name as the line image but with the image extension replaced by .gt.txt
.
The repository contains a ZIP archive with sample ground truth, extract it to ./data/foo-ground-truth
and run make training
.
Data Preparation: A bulk of the work done with Tesseract is in cleaning and preparing the data before training:
- Conversion: All images must be converted to a png file format.
- Processing: Every image is run through a Python script that currently uses Scikit-Image (specifically a savoula threshold) to edit the image to a standard that the computer (and Tesseract) best understands. Some images have to be dealt with manually for quality improvement.
- Segmentation: Tesseract best learns, understands and transcribes from small chunks of text (especially when it’s handwriting). Once every image is converted and processed, it is segmented by Tesseract into an average range of 20–40 segments (lines of text essentially).
- Ground Truth: During segmentation, Tesseract makes a transcription on the segment and outputs the result to a respective text file. This text file is a ground truth file that will be used in the training process of the new model.
Every ground truth file contains a mostly inaccurate transcription which needs to be manually corrected to represent the text that is in its respective image segment.
How to clean an image:
How to edit boxfile:
Training
Training can begin, and is an iterative, two-step process, once data prep has been completed:
- The tesstrain directory is specifically designed to fit this task through the use of Make. Essentially a long, shell-script-like program is able to build and train a new Tesseract language for you using your image/ground truth file pairs.
- This repository allows for training to be fairly customizable as well. Currently, we are testing various outcomes of selecting different page segmentation modes (ways in which Tesseract interprets a given image during training), etc.
make training MODEL_NAME=name-of-the-resulting-model
shortcut for make unicharset lists proto-model training
Run make help
to see all the possible targets and variables:
Step 1: Box File Editing
- The build automation provided by Make constructs all of the necessary files to begin training. However, a very important file is created that needs to be manually corrected (similar to the ground truth files from above).
- Box files are essentially text files that contain letters (pulled from the ground truth files) and the coordinates of the box that surrounds them recognized by the neural network under the hood.
These files must be manually corrected to place the boxes over the characters at the ideal widths and heights used for training. This process can be done with QTBoxEditor or jTessBoxEditor
Step 2: Update the Current Model
- A new model is created at the beginning of training but is based on the incorrect box files. The box files have to be edited before the model can be updated (from the last checkpoint) on the correctly represented boxes.
- Before updating, all .lstmf files must be removed so they can be replaced during re-training to contain the edited box files.
- The ssq.sh script can be run again to retrain but <model_name>_checkpoint must be provided as the model to continue training from (instead of eng).
These processes have been mostly automated by using a customized shell script we have created that runs each of these programs based on the provided flags/user input. Running ./ssq.sh — help will explain how to use the script.
Conclusion
Although the results are promising, the only challenge is that it’s a lot of work to manually clean the image and enter the corrected text. Regardless of the work involved now, this technology is here to stay. And will still update and grow. The best thing about Tesseract is that it is free and easy to use it. It is a command-line OCR engine tool but its utilization simplified significantly with Python wrapper called pytesseract. We noticed that Tesseract’s image processing is very immature or undeveloped. In order to get the most out of it, you need to use an image pre-processor or use an image that is already been processed.