Open source tools for document preprocessing

Document AI

Python open source tools to preprocess documents before calling Document AI services

Author

Rafa Sanchez

Published

August 28, 2022

https://github.com/GoogleCloudPlatform/document-ai-samples/tree/main/web-app-pix2info-python

https://cloud.google.com/document-ai/docs/handle-response#quickstart

Document AI toolbox an SDK for Python that provides utility functions for managing, manipulating, and extracting information from the document response. It creates a “wrapped” document object from a processed document response from JSON files in Cloud Storage, local JSON files, or output directly from the process_documents() method.

It can perform the following actions:

Combine sharded Documents JSON files from Batch Processing into a single “wrapped” Document. Access text from Pages, Lines, Paragraphs, FormFields, and Tables without handling Layout information. Search for a Pages containing a target string or matching a regular expression. Search for FormFields by name. Search for Entities by type. Convert Tables a Pandas Dataframe or CSV. Insert Entities into a BigQuery table. Split a PDF file based on output from a Splitter/Classifier processor. Convert Documents to and from commonly used formats. Cloud Vision API AnnotateFileResponse

Reading PDF files

PyPDF2 is a library to handle pDF files in python. It can detect if the file is encrypted, or is a zip file

Converting from image to PDF

img2pdf is a python library to convert images into pdf.

Detecting multiple files on the same page

If you can detect, for example, multiple receipts on the same page, using a split classifier may not work, since files must be across different pages.

A solution would be to use a service like AutoML Object detection (or any other object detection service) for intra-pages splitting.

Quality

You can use Document AI Quality processor to detect blurry, skered, folded documents. Document AI quality processor can also detect if the document is too small.

If documents are rotated or have bad quality, Document AI will handle that.