Open source tools for document preprocessing
https://github.com/GoogleCloudPlatform/document-ai-samples/tree/main/web-app-pix2info-python
https://cloud.google.com/document-ai/docs/handle-response#quickstart
Document AI toolbox an SDK for Python that provides utility functions for managing, manipulating, and extracting information from the document response. It creates a “wrapped” document object from a processed document response from JSON files in Cloud Storage, local JSON files, or output directly from the process_documents() method.
It can perform the following actions:
Combine sharded Documents JSON files from Batch Processing into a single “wrapped” Document. Access text from Pages, Lines, Paragraphs, FormFields, and Tables without handling Layout information. Search for a Pages containing a target string or matching a regular expression. Search for FormFields by name. Search for Entities by type. Convert Tables a Pandas Dataframe or CSV. Insert Entities into a BigQuery table. Split a PDF file based on output from a Splitter/Classifier processor. Convert Documents to and from commonly used formats. Cloud Vision API AnnotateFileResponse
Reading PDF files
PyPDF2 is a library to handle pDF files in python. It can detect if the file is encrypted, or is a zip file
Converting from image to PDF
img2pdf is a python library to convert images into pdf.
Detecting multiple files on the same page
If you can detect, for example, multiple receipts on the same page, using a split classifier may not work, since files must be across different pages.
A solution would be to use a service like AutoML Object detection (or any other object detection service) for intra-pages splitting.
Quality
You can use Document AI Quality processor to detect blurry, skered, folded documents. Document AI quality processor can also detect if the document is too small.
If documents are rotated or have bad quality, Document AI will handle that.