Skip to content

Processing chain

Florian edited this page Nov 29, 2019 · 1 revision

Processing chain is divided four-fold:

  1. Image processing................... DONE
  2. Text recognition................... DONE
  3. Text processing....................... ONGOING
  4. Data population...................... ONGOING

Image processing

  • Detect images to analyze based on Houghlines Probability approach ( e.g.: Page 28 from document 1922 contains 371 lines | 🔧 set to analyze.

  • Remove table edges, including:

    • remove noise with Gaussian blur,
    • convert image to grayscale
    • binarize:
    • correct skew angle:
      • Standard approach: estimate skew angle, create a rotation matrix and apply an affine transform
      • Advanced approach: 2D FFT transform
    • create a mask of horizontal and vertical lines and remove lines lines before after
  • Segment image

    • Method 1 - segment and find blocks contours:

      blocks segmentation

    • Method 2 - segment and find lines contours:

      line segmentation original

      Extract lines from each block:

      line segmentation original

      Clean "artifacts" to increase text recognition accuracy:

      line segmentation without artifact

Both methods give similar results (not published for now). Line segmentation allows to reprocess strings where text recognition fails to capture some characters.

Text recognition

Different OCR software candidates were considered, including:

  • Tesseract: free and open-source OCR engine, offers different segmentation and engine modes, works offline. Tesseract includes a neural network subsystem configured as a textline recognizer. Extensive documentation on tuning Tesseract are available here, here for custom tips or there to learn how to train Tesseract on specific characters.
  • Google Vision API (Document Text Detection): similar to Tesseract, the Google Cloud environment is a strong added value for production. Google Vision OCR offers a ”document text detection” mode optimized for text recognition on printed documents. It's a paid service, tuning options are limited and it's a black-box software
  • Amazon Textract: does an awful job on segmentation and character recognition, would appreciate additional feedback
  • Microsoft Azure Cognitive Service: similar to Amazon Textract in terms of awfulness level, would appreciate additional feedback too
  • Tabula: does a great job for table form data extraction, but only applies to PDF that contains plain text (i.e., not scanned documents).
  • ABBYY Flexicapture: great but paid and black-box software

In the end, Tesseract was chosen. From the processed image, Tesseract extracts blocks (or lines) from the image to convert them into text.

For Method 2 - line segmentation, if default settings fail to capture characters, the algorithm starts again text recognition under different segmentation (PSM) and engine mode (OEM) settings. For debugging, recognition settings are saved for page, block, line, OEM and PSM modes:

> ℹ : p28 b0 r4 reprocessed under oem 1 psm 11
> ℹ : p28 b0 r8 kept after first processing
> ℹ : p28 b0 r9 reprocessed under oem 1 psm 11
> ℹ : p28 b0 r10 reprocessed under oem 1 psm 11
> ℹ : p28 b0 r11 kept after first processing

p28 stands for page 28, b0 for block 0 out of N blocks, and r4 for line 4). I set default setting for Tesseract as --oem 0 --psm 6 (legacy mode). Compared with the recent LTSM neural network mode (--oem 1), the legacy mode (--oem 0) is based on recognizing character patterns. Since table forms include repetitive patterns, this motivates the choice to consider legacy for the OCR Engine mode.

Text processing

  • Postprocessing raw strings from Tesseract with RegEx functions, including:
    • split numerical values (e.g., 987.6) and alpha characters (e.g., location or variable),

    • decode control characters (e.g., "\n" to newline)

    • For numerical values (0-9 and punctuation) :

      • remove noisy characters ({, $, £, etc.)
      • correct digits (e.g., the "4" digit to "+" sign) and characters ("S" to "5", "O" to "0", "B" to "8", etc.)
      • remove blanks
      • correct characters with decimal point (e.g., "987-6" to "987.6")
      • infer the variable in a line or a block based on numerical values occurrence and string's length,
      • correct variables accordingly
      Input:
      7\n\n002°5 996-4 017:1 008-4 984-4 978-2 006:3 0023 018-8 017°0 0152 0127 009:3 992‘C 009-8 009'1 028-4 025'9 0272 025'3 999'4 981-9 021-8 012-3
      
      Output:
      7 002.5 996.4 017.1 008.4 984.4 978.2 006.3 002.3 018.8 017.0 015.2 012.7 009.3 992.6 009.8 009.1 028.4 025.9 027.2 025.3 999.4 981.9 021.8 012.3
      
    • For alpha characters (A-Za-z)

      • correct alpha characters from a predefined dictionary and Levenshtein distance metric
      • format characters (e.g., align legends, include meta information, etc.)
      • the last 1% or how to get to an (almost) perfect output structure

Data population

  • Export unstructured and data in tabular format (CSV, JSON):
  • Celebrate

celebrate

Clone this wiki locally