Processing chain

Processing chain is divided four-fold:

Image processing................... DONE
Text recognition................... DONE
Text processing....................... ONGOING
Data population...................... ONGOING

Image processing

Detect images to analyze based on Houghlines Probability approach ( e.g.: Page 28 from document 1922 contains 371 lines | 🔧 set to analyze.
Remove table edges, including:
- remove noise with Gaussian blur,
- convert image to grayscale
- binarize:
  - Simple
  - Adaptive
  - Otsu, 1979 - great for Observatories Year Book archive
  - Niblack, 1986
  - Sauvola et al., 1997
  - Su et al., 2010
- correct skew angle:
  - Standard approach: estimate skew angle, create a rotation matrix and apply an affine transform
  - Advanced approach: 2D FFT transform
- create a mask of horizontal and vertical lines and remove lines
Segment image
- Method 1 - segment and find blocks contours:
- Method 2 - segment and find lines contours:
  
  Extract lines from each block:
  
  Clean "artifacts" to increase text recognition accuracy:

Both methods give similar results (not published for now). Line segmentation allows to reprocess strings where text recognition fails to capture some characters.

Text recognition

Different OCR software candidates were considered, including:

Tesseract: free and open-source OCR engine, offers different segmentation and engine modes, works offline. Tesseract includes a neural network subsystem configured as a textline recognizer. Extensive documentation on tuning Tesseract are available here, here for custom tips or there to learn how to train Tesseract on specific characters.
Google Vision API (Document Text Detection): similar to Tesseract, the Google Cloud environment is a strong added value for production. Google Vision OCR offers a ”document text detection” mode optimized for text recognition on printed documents. It's a paid service, tuning options are limited and it's a black-box software
Amazon Textract: does an awful job on segmentation and character recognition, would appreciate additional feedback
Microsoft Azure Cognitive Service: similar to Amazon Textract in terms of awfulness level, would appreciate additional feedback too
Tabula: does a great job for table form data extraction, but only applies to PDF that contains plain text (i.e., not scanned documents).
ABBYY Flexicapture: great but paid and black-box software

In the end, Tesseract was chosen. From the processed image, Tesseract extracts blocks (or lines) from the image to convert them into text.

For Method 2 - line segmentation, if default settings fail to capture characters, the algorithm starts again text recognition under different segmentation (PSM) and engine mode (OEM) settings. For debugging, recognition settings are saved for page, block, line, OEM and PSM modes:

> ℹ : p28 b0 r4 reprocessed under oem 1 psm 11
> ℹ : p28 b0 r8 kept after first processing
> ℹ : p28 b0 r9 reprocessed under oem 1 psm 11
> ℹ : p28 b0 r10 reprocessed under oem 1 psm 11
> ℹ : p28 b0 r11 kept after first processing

p28 stands for page 28, b0 for block 0 out of N blocks, and r4 for line 4). I set default setting for Tesseract as --oem 0 --psm 6 (legacy mode). Compared with the recent LTSM neural network mode (--oem 1), the legacy mode (--oem 0) is based on recognizing character patterns. Since table forms include repetitive patterns, this motivates the choice to consider legacy for the OCR Engine mode.

Text processing

Postprocessing raw strings from Tesseract with RegEx functions, including:
- split numerical values (e.g., 987.6) and alpha characters (e.g., location or variable),
- decode control characters (e.g., "\n" to newline)
- For numerical values (0-9 and punctuation) :
  - remove noisy characters ({, $, £, etc.)
  - correct digits (e.g., the "4" digit to "+" sign) and characters ("S" to "5", "O" to "0", "B" to "8", etc.)
  - remove blanks
  - correct characters with decimal point (e.g., "987-6" to "987.6")
  - infer the variable in a line or a block based on numerical values occurrence and string's length,
  - correct variables accordingly
```
Input:
7\n\n002°5 996-4 017:1 008-4 984-4 978-2 006:3 0023 018-8 017°0 0152 0127 009:3 992‘C 009-8 009'1 028-4 025'9 0272 025'3 999'4 981-9 021-8 012-3

Output:
7 002.5 996.4 017.1 008.4 984.4 978.2 006.3 002.3 018.8 017.0 015.2 012.7 009.3 992.6 009.8 009.1 028.4 025.9 027.2 025.3 999.4 981.9 021.8 012.3
```
- For alpha characters (A-Za-z)
  - correct alpha characters from a predefined dictionary and Levenshtein distance metric
  - format characters (e.g., align legends, include meta information, etc.)
  - the last 1% or how to get to an (almost) perfect output structure

Data population

Export unstructured and data in tabular format (CSV, JSON):
Celebrate

celebrate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Processing chain

Image processing

Text recognition

Text processing

Data population

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally