Ocr document analysis

This document describes the OCR (Optical Character Recognition) integration for extracting text from uploaded documents using PaddleOCR.

Table of Contents¶

Overview
Multi-Tenancy Context
Architecture
Configuration
API Endpoints
OCR Configuration
Document Analysis (Company Admin)
Document Search
Document Admin
Company File Uploads
Database Schema
Docker Setup
Usage Examples

Overview¶

The OCR integration enables automatic text extraction from uploaded documents (PDFs, images, Word documents). Key features:

Per-company OCR configuration - Enable/disable OCR per company
Automatic processing - OCR runs automatically on file upload when enabled
Full-text search - Search within extracted document content
Multi-tenancy support - Client-isolated databases with company-isolated file storage

Multi-Tenancy Context¶

OCR configuration and processing operates within TGM's multi-tenant architecture:

┌─────────────────────────────────────────────────────────────────┐
│                       Platform Level                             │
│  OCR Service (PaddleOCR Docker) - Shared across all clients     │
│  Global OCR Enable/Disable (app.ocr.enabled)                    │
└─────────────────────────────────────────────────────────────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        ▼                      ▼                      ▼
┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│   Client A    │      │   Client B    │      │   Client C    │
│   Database    │      │   Database    │      │   Database    │
├───────────────┤      ├───────────────┤      ├───────────────┤
│   Company     │      │   Company     │      │   Company     │
│ ocrEnabled:   │      │ ocrEnabled:   │      │ ocrEnabled:   │
│   true        │      │   false       │      │   true        │
│ ocrLanguage:  │      │               │      │ ocrLanguage:  │
│   'en'        │      │               │      │   'fr'        │
├───────────────┤      └───────────────┘      ├───────────────┤
│ document_     │                             │ document_     │
│ contents      │                             │ contents      │
│ (table)       │                             │ (table)       │
└───────────────┘                             └───────────────┘

Key Points¶

OCR Service is shared across all clients (single Docker container)
OCR Configuration is per-company within each client's database
Document Contents are stored in each client's isolated database
Files are stored in company-specific folders in storage (MinIO/S3)

API Request Headers¶

All OCR API requests require multi-tenant headers:

curl -X GET "http://localhost:1337/api/companies/1/ocr-config" \
  -H "Authorization: Bearer $JWT" \
  -H "X-Client-ID: acme-corp"

Header	Required	Description
`Authorization`	Yes	JWT or API token
`X-Client-ID`	Yes*	Client identifier
`X-Tenant-ID`	No	Sandbox (omit for production)

*Can be resolved via subdomain: acme-corp.tgm-expert.com

Supported File Types¶

Type	Extensions
Images	PNG, JPG, JPEG, TIFF, BMP, WEBP
Documents	PDF, DOCX, DOC

Supported Languages¶

Code	Language
`ch`	Chinese (Simplified)
`en`	English
`fr`	French
`german`	German
`korean`	Korean
`japan`	Japanese

Architecture¶

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   File Upload   │────▶│  Spring Boot    │────▶│  PaddleOCR      │
│   (Frontend)    │     │  (Java)         │     │  (Docker)       │
└─────────────────┘     └────────┬────────┘     └─────────────────┘
                                 │
                    ┌────────────┴────────────┐
                    ▼                         ▼
              ┌──────────┐            ┌───────────────┐
              │PostgreSQL│            │     MinIO     │
              │(content) │            │ (file storage)│
              └──────────┘            └───────────────┘

Flow: 1. User uploads file via API 2. File stored in MinIO under company-specific folder 3. If OCR enabled for company, async processing triggered 4. PaddleOCR extracts text from document 5. Extracted text stored in document_contents table 6. Text indexed for full-text search

Configuration¶

Application Properties¶

app:
  ocr:
    # Enable/disable OCR globally
    enabled: ${OCR_ENABLED:false}

    # OCR service URL
    service-url: ${OCR_SERVICE_URL:http://localhost:8000}

    # OCR API endpoint path
    endpoint: ${OCR_ENDPOINT:/ocr}

    # Request timeout in seconds
    timeout-seconds: ${OCR_TIMEOUT_SECONDS:120}

Environment Variables¶

Variable	Default	Description
`OCR_ENABLED`	`false`	Enable OCR globally
`OCR_SERVICE_URL`	`http://localhost:8000`	PaddleOCR service URL
`OCR_ENDPOINT`	`/ocr`	OCR API endpoint path
`OCR_TIMEOUT_SECONDS`	`120`	Request timeout
`OCR_IMAGE`	`blazordevlab/paddleocrapi:latest`	Docker image to use
`OCR_LANGUAGE`	`ch`	Default OCR language
`OCR_USE_GPU`	`false`	Enable GPU acceleration

API Endpoints¶

OCR Configuration¶

Manage OCR settings per company.

Get OCR Configuration¶

GET /api/companies/{companyId}/ocr-config

Response:

{
  "data": {
    "companyId": 1,
    "companyName": "Acme Corp",
    "ocrEnabled": true,
    "ocrLanguage": "en",
    "ocrAutoProcess": true,
    "ocrFileTypes": "pdf,png,jpg,jpeg,tiff,docx,doc",
    "supportedFileTypes": ["pdf", "png", "jpg", "jpeg", "tiff", "docx", "doc"]
  }
}

Update OCR Configuration¶

PUT /api/companies/{companyId}/ocr-config
Authorization: Bearer {token}
Content-Type: application/json

Request Body:

{
  "ocrEnabled": true,
  "ocrLanguage": "en",
  "ocrAutoProcess": true,
  "ocrFileTypes": "pdf,png,jpg"
}

Field	Type	Description
`ocrEnabled`	boolean	Enable/disable OCR for company
`ocrLanguage`	string	Default language code (2-5 chars)
`ocrAutoProcess`	boolean	Auto-process uploads with OCR
`ocrFileTypes`	string	Comma-separated file extensions

Response:

{
  "data": {
    "companyId": 1,
    "companyName": "Acme Corp",
    "ocrEnabled": true,
    "ocrLanguage": "en",
    "ocrAutoProcess": true,
    "ocrFileTypes": "pdf,png,jpg",
    "supportedFileTypes": ["pdf", "png", "jpg"]
  },
  "message": "OCR configuration updated successfully"
}

Enable OCR¶

POST /api/companies/{companyId}/ocr-config/enable
Authorization: Bearer {token}

Required Role: ADMIN or SUPER_ADMIN

Disable OCR¶

POST /api/companies/{companyId}/ocr-config/disable
Authorization: Bearer {token}

Required Role: ADMIN or SUPER_ADMIN

Get OCR Service Status¶

Check if the OCR service is available.

GET /api/companies/ocr-status

Response:

{
  "data": {
    "serviceAvailable": true,
    "serviceUrl": "http://ocr-service:8000",
    "serviceVersion": "PaddleOCR",
    "supportedLanguages": [
      {"code": "ch", "name": "Chinese"},
      {"code": "en", "name": "English"},
      {"code": "fr", "name": "French"}
    ],
    "defaultLanguage": "ch"
  }
}

Document Analysis (Company Admin)¶

View and manage OCR document analyses for your organization. These endpoints are designed for company administrators to monitor OCR processing status through a sidebar menu in Organization Settings.

Base URL: /api/companies/{companyId}/document-analysis

Required Role: Authenticated user (company member)

Get Analysis Summary¶

Get dashboard statistics for document analysis in your organization.

GET /api/companies/{companyId}/document-analysis/summary
Authorization: Bearer {token}

Response:

{
  "data": {
    "companyId": 1,
    "companyName": "Acme Corp",
    "ocrEnabled": true,
    "ocrLanguage": "en",
    "ocrAutoProcess": true,
    "totalDocuments": 150,
    "completedCount": 140,
    "failedCount": 5,
    "pendingCount": 3,
    "processingCount": 2,
    "skippedCount": 0,
    "totalCharactersExtracted": 2456789,
    "serviceAvailable": true,
    "successRate": 96.55
  }
}

List Document Analyses¶

Get paginated list of all document analyses with optional status filter.

GET /api/companies/{companyId}/document-analysis?status={status}
Authorization: Bearer {token}

Parameters: | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | status | string | No | Filter by status: PENDING, PROCESSING, COMPLETED, FAILED, SKIPPED | | page | int | No | Page number (default: 0) | | size | int | No | Page size (default: 20) | | sort | string | No | Sort field (e.g., createdAt,desc) |

Response:

{
  "content": [
    {
      "id": 1,
      "fileId": 123,
      "fileName": "invoice.pdf",
      "fileUrl": "http://minio:9000/tgm-uploads/companies/1/documents/invoice.pdf",
      "mimeType": "application/pdf",
      "status": "COMPLETED",
      "statusLabel": "Completed",
      "errorMessage": null,
      "textLength": 5432,
      "wordCount": 876,
      "pageCount": 3,
      "ocrConfidence": 0.9523,
      "ocrLanguage": "en",
      "processingDurationMs": 15234,
      "createdAt": "2025-02-05T10:30:00",
      "processedAt": "2025-02-05T10:30:15",
      "textPreview": "INVOICE #12345..."
    }
  ],
  "totalElements": 150,
  "totalPages": 8,
  "number": 0,
  "size": 20
}

Get Analysis Detail¶

Get full details of a specific document analysis.

GET /api/companies/{companyId}/document-analysis/{analysisId}?includeFullText=false
Authorization: Bearer {token}

Parameters: | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | includeFullText | boolean | false | Include full extracted text in response |

Response:

{
  "data": {
    "id": 1,
    "fileId": 123,
    "fileName": "invoice.pdf",
    "fileUrl": "http://minio:9000/tgm-uploads/companies/1/documents/invoice.pdf",
    "mimeType": "application/pdf",
    "fileSizeBytes": 262144,
    "status": "COMPLETED",
    "statusLabel": "Completed",
    "errorMessage": null,
    "extractedText": "Full text here...",
    "textPreview": "INVOICE #12345...",
    "textLength": 5432,
    "wordCount": 876,
    "pageCount": 3,
    "ocrConfidence": 0.9523,
    "ocrLanguage": "en",
    "processingDurationMs": 15234,
    "processingStartedAt": "2025-02-05T10:30:00",
    "createdAt": "2025-02-05T10:29:55",
    "processedAt": "2025-02-05T10:30:15",
    "indexedForSearch": true,
    "confidencePercent": 95,
    "processingDurationSeconds": 15.234,
    "fileSizeKb": 256.0,
    "retryable": false,
    "complete": true,
    "inProgress": false
  }
}

Retry Failed Analysis¶

Retry OCR processing for a failed document.

POST /api/companies/{companyId}/document-analysis/{analysisId}/retry
Authorization: Bearer {token}

Required Role: ADMIN or MANAGER

Rate Limit: 10 requests per user per 60 seconds

Response:

{
  "data": {
    "message": "Analysis retry started",
    "analysisId": 1,
    "status": "processing"
  },
  "message": "OCR processing has been queued"
}

Error Response (OCR disabled):

{
  "error": "OCR is disabled for this organization. Enable it in settings first."
}

Retry All Failed Analyses¶

Queue all failed documents for reprocessing.

POST /api/companies/{companyId}/document-analysis/retry-all-failed
Authorization: Bearer {token}

Required Role: ADMIN or MANAGER

Rate Limit: 3 requests per company per 5 minutes

Response:

{
  "data": {
    "message": "Retry started for all failed analyses",
    "count": 5,
    "status": "processing"
  },
  "message": "5 documents queued for reprocessing"
}

Get Queue Status¶

Get current OCR queue status for monitoring.

GET /api/companies/{companyId}/document-analysis/queue-status
Authorization: Bearer {token}

Required Role: ADMIN or MANAGER

Response:

{
  "data": {
    "pending": 5,
    "processing": 2,
    "completed": 150,
    "failed": 3,
    "skipped": 1,
    "total": 161
  }
}

Manually Analyze a File¶

Trigger OCR analysis for a specific file.

POST /api/companies/{companyId}/document-analysis/analyze/{fileId}
Authorization: Bearer {token}

Required Role: ADMIN or MANAGER

Rate Limit: 20 requests per user per 60 seconds

Response (success):

{
  "data": {
    "message": "OCR analysis started",
    "fileId": 123,
    "fileName": "contract.pdf",
    "status": "processing"
  },
  "message": "File queued for OCR processing"
}

Error Responses: - OCR disabled: "OCR is disabled for this organization. Enable it in settings first." - Service unavailable: "OCR service is not available. Please try again later." - Unsupported type: "File type not supported for OCR: video.mp4" - Already processed: "File has already been analyzed successfully"

Document Search¶

Search within OCR-extracted document content.

Search Documents¶

GET /search/documents?companyId={companyId}&q={query}

Parameters: | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | companyId | long | Yes | Company ID | | q | string | Yes | Search query | | page | int | No | Page number (default: 0) | | size | int | No | Page size (default: 20) |

Response:

{
  "content": [
    {
      "id": 1,
      "fileId": 123,
      "fileType": "strapi",
      "textLength": 5432,
      "wordCount": 876,
      "ocrConfidence": 0.9523,
      "ocrLanguage": "en",
      "pageCount": 3,
      "processingStatus": "COMPLETED",
      "indexedForSearch": true,
      "companyId": 1,
      "createdAt": "2025-02-05T10:30:00",
      "processedAt": "2025-02-05T10:30:15",
      "processingDurationMs": 15234,
      "textPreview": "This is the first 500 characters of the extracted text..."
    }
  ],
  "totalElements": 25,
  "totalPages": 2,
  "number": 0,
  "size": 20
}

Get Document Content¶

GET /search/documents/{fileId}?includeFullText=false

Parameters: | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | includeFullText | boolean | false | Include full extracted text |

Document Admin¶

Administrative endpoints for managing OCR processing.

Base URL: /api/admin/documents

Required Role: ADMIN or SUPER_ADMIN

Get OCR Service Status¶

GET /api/admin/documents/ocr-status

Get OCR Statistics¶

GET /api/admin/documents/stats?companyId={companyId}

Response:

{
  "companyId": 1,
  "totalDocuments": 150,
  "statusCounts": {
    "COMPLETED": 140,
    "FAILED": 5,
    "PENDING": 3,
    "PROCESSING": 2
  },
  "totalCharactersExtracted": 2456789,
  "indexedDocuments": 140
}

List Documents¶

GET /api/admin/documents?companyId={companyId}&status={status}

Parameters: | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | companyId | long | Yes | Company ID | | status | string | No | Filter by status: PENDING, PROCESSING, COMPLETED, FAILED, SKIPPED | | page | int | No | Page number | | size | int | No | Page size |

Get Document Content¶

GET /api/admin/documents/{fileId}?includeFullText=true

Manually Process File¶

Trigger OCR processing for a specific file.

POST /api/admin/documents/{fileId}/process?companyId={companyId}

Response:

{
  "message": "OCR processing started for file 123",
  "status": "processing"
}

Reprocess Failed File¶

POST /api/admin/documents/{fileId}/reprocess

Reprocess All Failed Documents¶

POST /api/admin/documents/reprocess-failed?companyId={companyId}

Response:

{
  "message": "Reprocessing started for failed documents in company 1",
  "status": "processing"
}

Delete Document Content¶

DELETE /api/admin/documents/{fileId}

Company File Uploads (Multi-Tenancy)¶

Upload files to company-specific folders in MinIO for tenant isolation.

Storage Pattern: companies/{companyId}/{subFolder}/{hash}{ext}

Upload File for Company¶

POST /api/upload/company/{companyId}?folder={subFolder}
Content-Type: multipart/form-data

Example:

curl -X POST "http://localhost:1337/api/upload/company/1?folder=documents" \
  -H "Authorization: Bearer $TOKEN" \
  -F "files=@invoice.pdf"

Response:

[
  {
    "id": 123,
    "name": "invoice.pdf",
    "hash": "invoice_abc123def4",
    "ext": ".pdf",
    "mime": "application/pdf",
    "size": 256.5,
    "url": "http://minio:9000/tgm-uploads/companies/1/documents/invoice_abc123def4.pdf",
    "provider": "minio",
    "folderPath": "/companies/1/documents"
  }
]

Upload Multiple Files for Company¶

POST /api/upload/company/{companyId}/multiple?folder={subFolder}
Content-Type: multipart/form-data

Upload and Link File for Company¶

POST /api/upload/company/{companyId}/link
Content-Type: multipart/form-data

Parameters: | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | companyId | long | Yes | Company ID (path) | | files | file | Yes | File to upload | | folder | string | No | Sub-folder | | ref | string | Yes | Entity type (e.g., api::article.article) | | refId | long | Yes | Entity ID | | field | string | Yes | Field name (e.g., attachments) |

Database Schema¶

Company Table Additions¶

ALTER TABLE companies
    ADD COLUMN ocr_enabled BOOLEAN DEFAULT FALSE,
    ADD COLUMN ocr_language VARCHAR(10) DEFAULT 'en',
    ADD COLUMN ocr_auto_process BOOLEAN DEFAULT TRUE,
    ADD COLUMN ocr_file_types TEXT DEFAULT 'pdf,png,jpg,jpeg,tiff,docx,doc';

Document Contents Table¶

CREATE TABLE document_contents (
    id BIGSERIAL PRIMARY KEY,
    file_id BIGINT NOT NULL,
    file_type VARCHAR(20) DEFAULT 'strapi',
    extracted_text TEXT,
    text_length INTEGER,
    ocr_confidence DECIMAL(5,4),
    ocr_language VARCHAR(10),
    page_count INTEGER DEFAULT 1,
    word_count INTEGER,
    processing_status VARCHAR(20) DEFAULT 'pending',
    error_message TEXT,
    processing_started_at TIMESTAMP,
    processing_duration_ms INTEGER,
    indexed_for_search BOOLEAN DEFAULT FALSE,
    search_embedding_id BIGINT,
    company_id BIGINT NOT NULL REFERENCES companies(id),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    processed_at TIMESTAMP
);

-- Full-text search index
CREATE INDEX idx_document_contents_text_search
    ON document_contents USING gin(to_tsvector('english', COALESCE(extracted_text, '')));

Processing Status Values¶

Status	Description
`PENDING`	Queued for processing
`PROCESSING`	Currently being processed
`COMPLETED`	Successfully processed
`FAILED`	Processing failed (see error_message)
`SKIPPED`	Skipped (unsupported type or OCR disabled)

Docker Setup¶

Starting OCR Service¶

# Start all services including OCR
docker-compose --profile ocr up -d

# Or start only OCR service
docker-compose --profile ocr up -d ocr-service

Using Alternative Docker Images¶

# Use bvdcode/paddleocrapi
OCR_IMAGE=bvdcode/paddleocrapi:latest docker-compose --profile ocr up -d

# Use m986883511/paddleocr:api (GPU support)
OCR_IMAGE=m986883511/paddleocr:api docker-compose --profile ocr up -d

Docker Compose Configuration¶

ocr-service:
  image: ${OCR_IMAGE:-blazordevlab/paddleocrapi:latest}
  container_name: tgm-ocr-service
  environment:
    - LANG=${OCR_LANGUAGE:-ch}
    - USE_GPU=${OCR_USE_GPU:-false}
  ports:
    - "8000:8000"
  networks:
    - tgm-network
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8000/"]
    interval: 30s
    timeout: 10s
    start_period: 120s
    retries: 3
  profiles:
    - ocr

GPU Support¶

For GPU acceleration, uncomment the deploy section in docker-compose.yml:

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]

Usage Examples¶

Enable OCR for a Company¶

# Enable OCR
curl -X POST "http://localhost:1337/api/companies/1/ocr-config/enable" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Configure OCR settings
curl -X PUT "http://localhost:1337/api/companies/1/ocr-config" \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "ocrEnabled": true,
    "ocrLanguage": "en",
    "ocrAutoProcess": true,
    "ocrFileTypes": "pdf,png,jpg,docx"
  }'

Upload Document with OCR¶

# Upload to company folder (OCR auto-triggered if enabled)
curl -X POST "http://localhost:1337/api/upload/company/1?folder=invoices" \
  -H "Authorization: Bearer $TOKEN" \
  -F "files=@invoice.pdf"

Search Document Content¶

# Search for documents containing "invoice"
curl "http://localhost:1337/search/documents?companyId=1&q=invoice" \
  -H "Authorization: Bearer $TOKEN"

Manually Trigger OCR¶

# Process a specific file
curl -X POST "http://localhost:1337/api/admin/documents/123/process?companyId=1" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Reprocess all failed documents
curl -X POST "http://localhost:1337/api/admin/documents/reprocess-failed?companyId=1" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Check OCR Status¶

# Check service health
curl "http://localhost:1337/api/companies/ocr-status" \
  -H "Authorization: Bearer $TOKEN"

# Get processing statistics
curl "http://localhost:1337/api/admin/documents/stats?companyId=1" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

View Document Analysis (Company Admin)¶

# Get analysis summary for dashboard
curl "http://localhost:1337/api/companies/1/document-analysis/summary" \
  -H "Authorization: Bearer $TOKEN"

# List all analyses
curl "http://localhost:1337/api/companies/1/document-analysis" \
  -H "Authorization: Bearer $TOKEN"

# List only failed analyses
curl "http://localhost:1337/api/companies/1/document-analysis?status=FAILED" \
  -H "Authorization: Bearer $TOKEN"

# Get detail for specific analysis
curl "http://localhost:1337/api/companies/1/document-analysis/123?includeFullText=true" \
  -H "Authorization: Bearer $TOKEN"

# Retry a failed analysis
curl -X POST "http://localhost:1337/api/companies/1/document-analysis/123/retry" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Retry all failed analyses
curl -X POST "http://localhost:1337/api/companies/1/document-analysis/retry-all-failed" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Manually trigger analysis for a file
curl -X POST "http://localhost:1337/api/companies/1/document-analysis/analyze/456" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Frontend Integration¶

The Document Analysis feature is designed to be displayed as a sidebar menu item in Organization Settings. Here's the recommended UI structure:

Sidebar Menu Entry:

Organization Settings
├── General
├── Members
├── Document Analysis  <-- New entry
└── ...

Dashboard View (/org/:companyId/document-analysis):

Display the summary statistics from GET /document-analysis/summary: - Total documents processed - Success rate percentage - Status breakdown (completed, failed, pending, processing) - OCR enabled/disabled status - Service availability indicator

List View (/org/:companyId/document-analysis/list):

Display paginated table from GET /document-analysis: | Column | Description | |--------|-------------| | File Name | Link to file preview | | Status | Badge with color (green=completed, red=failed, yellow=pending) | | Words | Word count extracted | | Confidence | OCR confidence percentage | | Processed | When processing completed | | Actions | Retry button (for failed/skipped) |

Filters: - Status dropdown (All, Completed, Failed, Pending, Processing, Skipped) - Date range picker

Detail View (/org/:companyId/document-analysis/:id):

Show full analysis details including: - File preview/download link - Processing timeline - Extracted text (with copy button) - OCR confidence score - Error message (if failed) - Retry button

Status Badge Colors¶

Status	Color	Icon
COMPLETED	Green	✓
FAILED	Red	✗
PENDING	Yellow	⏳
PROCESSING	Blue	⟳
SKIPPED	Gray	−

Troubleshooting¶

OCR Service Not Available¶

Check if OCR is enabled globally: OCR_ENABLED=true
Verify OCR service is running: docker-compose --profile ocr ps
Check service health: curl http://localhost:8000/

Processing Failures¶

Check document status via admin API
Review error messages in document_contents.error_message
Check OCR service logs: docker-compose logs ocr-service

Slow Processing¶

Consider enabling GPU: OCR_USE_GPU=true
Increase timeout: OCR_TIMEOUT_SECONDS=300
Check file size limits in OCR service configuration

Testing¶

The Document Analysis feature includes comprehensive test coverage:

Unit Tests¶

DTO Tests (DocumentAnalysisDtoTest.java) - 32 tests - DocumentAnalysisSummaryDto: Builder pattern, getters/setters, success rate calculation - DocumentAnalysisItemDto: Confidence percent calculation, processing duration, retryable status - DocumentAnalysisDetailDto: File size conversion, status helpers (isComplete, isInProgress, isRetryable) - Edge cases: Zero values, null handling, large numbers

Controller Unit Tests (DocumentAnalysisControllerTest.java) - 22 tests - GET /summary - Returns correct summary with mocked service - GET / - Pagination, status filtering, file info mapping - GET /{analysisId} - Detail retrieval with/without full text - POST /{analysisId}/retry - Authorization, status validation - POST /retry-all-failed - Batch retry with count - POST /analyze/{fileId} - File validation, OCR checks

Integration Tests¶

Controller Integration Tests (DocumentAnalysisControllerIntegrationTest.java) - 28 tests

Uses Testcontainers for real database testing:

Test Category	Tests	Description
GetSummary	3	Summary stats, auth, 404 handling
ListAnalyses	5	Pagination, COMPLETED/FAILED/PENDING filters, file info
GetAnalysisDetail	4	Detail with/without text, company isolation
RetryAnalysis	4	Admin retry, regular user 403, status validation
RetryAllFailed	3	Batch retry, zero count, authorization
AnalyzeFile	4	OCR disabled, service unavailable, authorization
DataIntegrity	2	Company isolation, database consistency
EdgeCases	3	Empty company, pagination edge cases

Running Tests¶

# Run all tests
mvn test

# Run only Document Analysis tests
mvn test -Dtest="DocumentAnalysis*"

# Run only integration tests
mvn test -Dtest="DocumentAnalysisControllerIntegrationTest"

# Run only DTO tests
mvn test -Dtest="DocumentAnalysisDtoTest"

Test Data Setup¶

Integration tests create: - Test company with OCR enabled - Regular user (Authenticated role) - Admin user (Admin role) - Sample files (PDF) in different states: - COMPLETED: With extracted text, confidence, word count - FAILED: With error message - PENDING: Awaiting processing

Production Readiness¶

Implemented Features¶

Feature	Status	Notes
Authorization	✅	Role-based: Authenticated (view), ADMIN/MANAGER (actions)
Multi-tenancy	✅	Company isolation via `company_id` foreign key
Error Handling	✅	Meaningful messages for all failure scenarios
Async Processing	✅	`@Async("ocrExecutor")` for non-blocking OCR
Pagination	✅	Spring Data pagination with status filtering + max page size (100)
Rate Limiting	✅	Redis-based rate limiting on retry/analyze endpoints
Metrics & Monitoring	✅	Micrometer metrics for Prometheus/Actuator
API Documentation	✅	OpenAPI annotations on all endpoints
Logging	✅	Structured logging with SLF4J
Test Coverage	✅	117+ tests (DTO + unit + integration)

Security Considerations¶

Aspect	Status	Details
Authentication	✅	JWT required for all endpoints
Authorization	✅	`@PreAuthorize` on all endpoints
Company Isolation	✅	Documents filtered by `companyId` parameter
Input Validation	✅	Status enum validated, max page size enforced
Rate Limiting	✅	Configurable per-user and per-company limits

Rate Limiting Configuration¶

Rate limits are configurable via application.yml:

app:
  ocr:
    rate-limit:
      retry:
        max-requests: 10        # Max retries per user per minute
        window-seconds: 60
      analyze:
        max-requests: 20        # Max analyze requests per user per minute
        window-seconds: 60
      retry-all:
        max-requests: 3         # Max batch retries per company per 5 minutes
        window-seconds: 300
    pagination:
      max-page-size: 100        # Max items per page

Monitoring & Metrics¶

OCR processing exposes Prometheus metrics via Spring Actuator:

Counters: | Metric | Description | |--------|-------------| | ocr.documents.processed.total | Total documents submitted for processing | | ocr.documents.success.total | Successful OCR extractions | | ocr.documents.failed.total | Failed OCR extractions | | ocr.retry.total | Manual retry requests | | ocr.retry.all.total | Batch retry requests |

Gauges: | Metric | Description | |--------|-------------| | ocr.queue.pending | Documents awaiting processing | | ocr.queue.processing | Documents currently being processed | | ocr.queue.failed | Documents with failed status |

Timer: | Metric | Description | |--------|-------------| | ocr.processing.duration | Processing time (p50, p75, p95, p99) |

Queue Status Endpoint:

GET /api/companies/{companyId}/document-analysis/queue-status

Returns:

{
  "data": {
    "pending": 5,
    "processing": 2,
    "completed": 150,
    "failed": 3,
    "skipped": 1,
    "total": 161
  }
}

Health Endpoint:

GET /actuator/health

Returns comprehensive health status including OCR service (when enabled):

{
  "status": "UP",
  "components": {
    "application": {
      "status": "UP",
      "details": {
        "name": "TGM Manager Server",
        "version": "1.1.0",
        "features": {
          "ocr": true,
          "influxDb": false,
          "semanticSearch": false,
          "email": true,
          "sms": false,
          "sso": true,
          "license": true,
          "rabbitmq": true,
          "ragflow": false,
          "cron": true
        }
      }
    },
    "db": { "status": "UP" },
    "minio": { "status": "UP", "details": { "bucket": "tgm-uploads" } },
    "rabbit": { "status": "UP", "details": { "version": "3.12.14" } },
    "redis": { "status": "UP", "details": { "version": "8.4.0" } },
    "ocr": { "status": "UP", "details": { "serviceUrl": "http://localhost:8000" } },
    "influxdb": { "status": "UP", "details": { "org": "ensolutions" } }
  }
}

Performance¶

Aspect	Implementation
Async OCR	Dedicated `ocrExecutor` thread pool
Database Indexes	Full-text search index on `extracted_text`
Lazy Loading	File info fetched separately per document
Pagination	Server-side with enforced max size (100)
Rate Limiting	Redis-based via CacheService

Alerting Recommendations¶

Use the metrics to set up alerts:

# Prometheus alerting rules
groups:
  - name: ocr-alerts
    rules:
      - alert: OcrQueueBacklogHigh
        expr: ocr_queue_pending > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "OCR queue backlog is high ({{ $value }} pending)"

      - alert: OcrHighFailureRate
        expr: rate(ocr_documents_failed_total[5m]) / rate(ocr_documents_processed_total[5m]) > 0.3
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "OCR failure rate exceeds 30%"

Ocr document analysis

Table of Contents¶

Overview¶

Multi-Tenancy Context¶

Key Points¶

API Request Headers¶

Supported File Types¶

Supported Languages¶

Architecture¶

Configuration¶

Application Properties¶

Environment Variables¶

API Endpoints¶

OCR Configuration¶

Get OCR Configuration¶

Update OCR Configuration¶

Enable OCR¶

Disable OCR¶

Get OCR Service Status¶

Document Analysis (Company Admin)¶

Get Analysis Summary¶

List Document Analyses¶

Get Analysis Detail¶

Retry Failed Analysis¶

Retry All Failed Analyses¶

Get Queue Status¶

Manually Analyze a File¶

Document Search¶

Search Documents¶

Get Document Content¶

Document Admin¶

Get OCR Service Status¶

Get OCR Statistics¶

List Documents¶

Get Document Content¶

Manually Process File¶

Reprocess Failed File¶

Reprocess All Failed Documents¶

Delete Document Content¶

Company File Uploads (Multi-Tenancy)¶

Upload File for Company¶

Upload Multiple Files for Company¶

Upload and Link File for Company¶

Database Schema¶

Company Table Additions¶

Document Contents Table¶

Processing Status Values¶

Docker Setup¶

Starting OCR Service¶

Using Alternative Docker Images¶

Docker Compose Configuration¶

GPU Support¶

Usage Examples¶

Enable OCR for a Company¶

Upload Document with OCR¶

Search Document Content¶

Manually Trigger OCR¶

Check OCR Status¶

View Document Analysis (Company Admin)¶

Frontend Integration¶

Document Analysis Sidebar Menu¶

Status Badge Colors¶

Troubleshooting¶

OCR Service Not Available¶

Processing Failures¶

Slow Processing¶

Testing¶

Unit Tests¶

Integration Tests¶

Running Tests¶

Test Data Setup¶

Production Readiness¶

Implemented Features¶

Security Considerations¶

Rate Limiting Configuration¶

Monitoring & Metrics¶

Performance¶

Alerting Recommendations¶