Google OCR and Document AI
History and OCR Technology
It’s challenged to achieve a very high accuracy in OCR due to lots of factors. Back in 2006, Google invested in Tesseract, a proprietary software at Hewlett Packard labs, and made it open sourced. Here is an overview paper of Tessearct OCR from Google from 2006.
Tessearct has evolved a lot in the last years, with major revisions, like version 4, which adds LSTM and models for many additional languages (up to 116 languages), and version 5. It’s now an independent and open source project from Google AI.
This paper shows a comparison between AWS Textract, Tessearct and Google OCR, showing Google proprietary OCR as #1.
The current Document AI technology of Google uses self-attention mechanisms as described in this blog post.
OCR services in Google
Services offering OCR in Google are the following:
- Vision API
DOCUMENT_TEXT_DETECTION: performs OCR on dense text images, such as documents (PDF/TIFF), and images with handwriting. TEXT_DETECTION can be used for sparse text images. See here. - Vision API
TEXT_DETECTION: performs OCR on text within the image. Text detection is optimized for areas of sparse text within a larger image. If the image is a document (PDF/TIFF), has dense text, or contains handwriting, use DOCUMENT_TEXT_DETECTION instead. See here. - Document AI OCR Processor: performs OCR supporting multiple formats and resolutions, including quality information (as of v1.1).
- ML Kit text recognition: for Android, iOS and Web development, even offline
- Gmail OCR: extracting text from GMail attachments
- Google Drive OCR: https://support.google.com/drive/answer/176692?hl=en
Differences between Vision API OCR and Document AI OCR
Underlying models are the same in Vision API and Document AI, although Document AI can provide more extra info about the document structure.
- Both
DOCUMENT_TEXT_DETECTIONandTEXT_DETECTION, as well as Document AI, are supported for both online/synchronous and offline/asynchronous for large batch file (PDF/TIFF) processing. - On pricing, cost per 1000 documents is the same for Document AI OCR and Vision API
DOCUMENT_TEXT_DETECTIONandTEXT_DETECTIONHowever, Vision API gives the first 1000 units/month for free, where DocAI doesn’t. - Vision API can accept PDF/TIFF Files up to 2000 pages per batch request, but Document AI OCR can only accept up to 500 pages per batch request.
- Document AI delivers multiple parser versions: v1.0, v1.1 and v1.2. v1.1 brings quality parameters without extra cost. v1.2 removes the quality parameters, but adds other capabilities.
hOCR
hOCR format is not supported today in Google OCR products. hOCR is an open standard to represent OCR outpust.
You or a partner can write a converter from the Document Proto if necessary. Document Proto is described in the REST API documentation here containing all the parser information. Converting to hOCR requires a specific convertion tool, like this public one available for Vision API OCR to hOCR, but adapted to Document AI output.
There is also a package to process hOCR.