Google OCR and Document AI

Document AI

A review of OCR alternatives available in Google and differences between Vision API OCR and Document AI OCR

Author

Rafa Sanchez

Published

August 28, 2022

History and OCR Technology

It’s challenged to achieve a very high accuracy in OCR due to lots of factors. Back in 2006, Google invested in Tesseract, a proprietary software at Hewlett Packard labs, and made it open sourced. Here is an overview paper of Tessearct OCR from Google from 2006.

Tessearct has evolved a lot in the last years, with major revisions, like version 4, which adds LSTM and models for many additional languages (up to 116 languages), and version 5. It’s now an independent and open source project from Google AI.

This paper shows a comparison between AWS Textract, Tessearct and Google OCR, showing Google proprietary OCR as #1.

The current Document AI technology of Google uses self-attention mechanisms as described in this blog post.

OCR services in Google

Services offering OCR in Google are the following:

Vision API DOCUMENT_TEXT_DETECTION: performs OCR on dense text images, such as documents (PDF/TIFF), and images with handwriting. TEXT_DETECTION can be used for sparse text images. See here.
Vision API TEXT_DETECTION: performs OCR on text within the image. Text detection is optimized for areas of sparse text within a larger image. If the image is a document (PDF/TIFF), has dense text, or contains handwriting, use DOCUMENT_TEXT_DETECTION instead. See here.
Document AI OCR Processor: performs OCR supporting multiple formats and resolutions, including quality information (as of v1.1).
ML Kit text recognition: for Android, iOS and Web development, even offline
Gmail OCR: extracting text from GMail attachments
Google Drive OCR: https://support.google.com/drive/answer/176692?hl=en

Differences between Vision API OCR and Document AI OCR

Underlying models are the same in Vision API and Document AI, although Document AI can provide more extra info about the document structure.

Both DOCUMENT_TEXT_DETECTION and TEXT_DETECTION, as well as Document AI, are supported for both online/synchronous and offline/asynchronous for large batch file (PDF/TIFF) processing.
On pricing, cost per 1000 documents is the same for Document AI OCR and Vision API DOCUMENT_TEXT_DETECTION and TEXT_DETECTION However, Vision API gives the first 1000 units/month for free, where DocAI doesn’t.
Vision API can accept PDF/TIFF Files up to 2000 pages per batch request, but Document AI OCR can only accept up to 500 pages per batch request.
Document AI delivers multiple parser versions: v1.0, v1.1 and v1.2. v1.1 brings quality parameters without extra cost. v1.2 removes the quality parameters, but adds other capabilities.

hOCR

hOCR format is not supported today in Google OCR products. hOCR is an open standard to represent OCR outpust.

You or a partner can write a converter from the Document Proto if necessary. Document Proto is described in the REST API documentation here containing all the parser information. Converting to hOCR requires a specific convertion tool, like this public one available for Vision API OCR to hOCR, but adapted to Document AI output.

There is also a package to process hOCR.