OCR and Document Intelligence
This page documents the server-side OCR helpers in SC/suredms-web-service/src/main/java/com/sureclinical/suredms/ocr. The key entry points are SC/suredms-web-service/src/main/java/com/sureclinical/suredms/ocr/OcrUtils.java, SC/suredms-web-service/src/main/java/com/sureclinical/suredms/ocr/DocumentAiUtils.java, and SC/suredms-web-service/src/main/java/com/sureclinical/suredms/ocr/VisionApiUtils.java.
Purpose
The OCR layer extracts text from PDFs and images, optionally adds OCR text layers back into generated PDFs, and normalizes page and image handling so other services can consume the results.
Scope
This page focuses on OCR and document intelligence helpers. It does not cover the broader web-service API or desktop conversion utilities unless they feed OCR directly.
Entry Points
SC/suredms-web-service/src/main/java/com/sureclinical/suredms/ocr/OcrUtils.javaSC/suredms-web-service/src/main/java/com/sureclinical/suredms/ocr/OcrSettings.javaSC/suredms-web-service/src/main/java/com/sureclinical/suredms/ocr/DocumentAiUtils.javaSC/suredms-web-service/src/main/java/com/sureclinical/suredms/ocr/VisionApiUtils.javaSC/suredms-web-service/src/main/java/com/sureclinical/suredms/ocr/tesseract/TesseractUtils.java
Primary Components
OcrUtilsis the top-level facade. It chooses the engine, reads OCR text, adds OCR layers, and preprocesses pages for OCR runs.OcrSettingsholds processing options such as DPI, page size, orientation, auto-rotate, debug mode, and image colorspace.DocumentAiUtilswraps Google Document AI integration, including caching of OCR results and PDF generation with OCR overlays.VisionApiUtilswraps Google Vision OCR and converts OCR responses into the PDF overlay path.TesseractUtilsprovides a local OCR path for image-based text extraction.
Data Flow
- Callers pass a PDF or image file into
OcrUtilswith an engine name. OcrUtilsroutes the request to Vision, Tesseract, or Document AI.- The selected engine extracts OCR data or text.
- For overlay generation, pages may be preprocessed, split into images, and rendered back into a searchable PDF.
DocumentAiUtilscaches OCR output so repeated runs can reuse prior results.
Key Behaviors
- Vision engine output is serialized to JSON-like output per page.
- Tesseract is limited to image inputs for now.
- Document AI is the more complete path for PDFs and multi-page document processing.
- Page size and DPI settings control image scaling before OCR overlay generation.
Dependencies and Integrations
- Google Cloud Vision and Document AI provide OCR services.
- iText PDF OCR classes create the final searchable PDF outputs.
- Shared image and PDF utilities handle page conversion, resizing, and color-space adjustments.
Edge Cases and Constraints
OcrUtilsstill contains TODOs for multi-file support and broader PDF handling in the Tesseract path.DocumentAiUtilscan use either an in-memory runtime cache or a static cache folder for development and testing.VisionApiUtilscurrently contains a very basic localGoogleOcrEngineadapter.