This document describes the OCR (Optical Character Recognition) integration for extracting text from uploaded documents using PaddleOCR.

Table of Contents


Overview

The OCR integration enables automatic text extraction from uploaded documents (PDFs, images, Word documents). Key features:

  • Per-company OCR configuration - Enable/disable OCR per company
  • Automatic processing - OCR runs automatically on file upload when enabled
  • Full-text search - Search within extracted document content
  • Multi-tenancy support - Client-isolated databases with company-isolated file storage

Multi-Tenancy Context

OCR configuration and processing operates within TGM's multi-tenant architecture:

┌─────────────────────────────────────────────────────────────────┐
│                       Platform Level                             │
│  OCR Service (PaddleOCR Docker) - Shared across all clients     │
│  Global OCR Enable/Disable (app.ocr.enabled)                    │
└─────────────────────────────────────────────────────────────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        ▼                      ▼                      ▼
┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│   Client A    │      │   Client B    │      │   Client C    │
│   Database    │      │   Database    │      │   Database    │
├───────────────┤      ├───────────────┤      ├───────────────┤
│   Company     │      │   Company     │      │   Company     │
│ ocrEnabled:   │      │ ocrEnabled:   │      │ ocrEnabled:   │
│   true        │      │   false       │      │   true        │
│ ocrLanguage:  │      │               │      │ ocrLanguage:  │
│   'en'        │      │               │      │   'fr'        │
├───────────────┤      └───────────────┘      ├───────────────┤
│ document_     │                             │ document_     │
│ contents      │                             │ contents      │
│ (table)       │                             │ (table)       │
└───────────────┘                             └───────────────┘

Key Points

  1. OCR Service is shared across all clients (single Docker container)
  2. OCR Configuration is per-company within each client's database
  3. Document Contents are stored in each client's isolated database
  4. Files are stored in company-specific folders in storage (MinIO/S3)

API Request Headers

All OCR API requests require multi-tenant headers:

curl -X GET "http://localhost:1337/api/companies/1/ocr-config" \
  -H "Authorization: Bearer $JWT" \
  -H "X-Client-ID: acme-corp"
Header Required Description
Authorization Yes JWT or API token
X-Client-ID Yes* Client identifier
X-Tenant-ID No Sandbox (omit for production)

*Can be resolved via subdomain: acme-corp.tgm-expert.com

Supported File Types

Type Extensions
Images PNG, JPG, JPEG, TIFF, BMP, WEBP
Documents PDF, DOCX, DOC

Supported Languages

Code Language
ch Chinese (Simplified)
en English
fr French
german German
korean Korean
japan Japanese

Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   File Upload   │────▶│  Spring Boot    │────▶│  PaddleOCR      │
│   (Frontend)    │     │  (Java)         │     │  (Docker)       │
└─────────────────┘     └────────┬────────┘     └─────────────────┘
                                 │
                    ┌────────────┴────────────┐
                    ▼                         ▼
              ┌──────────┐            ┌───────────────┐
              │PostgreSQL│            │     MinIO     │
              │(content) │            │ (file storage)│
              └──────────┘            └───────────────┘

Flow: 1. User uploads file via API 2. File stored in MinIO under company-specific folder 3. If OCR enabled for company, async processing triggered 4. PaddleOCR extracts text from document 5. Extracted text stored in document_contents table 6. Text indexed for full-text search


Configuration

Application Properties

app:
  ocr:
    # Enable/disable OCR globally
    enabled: ${OCR_ENABLED:false}

    # OCR service URL
    service-url: ${OCR_SERVICE_URL:http://localhost:8000}

    # OCR API endpoint path
    endpoint: ${OCR_ENDPOINT:/ocr}

    # Request timeout in seconds
    timeout-seconds: ${OCR_TIMEOUT_SECONDS:120}

Environment Variables

Variable Default Description
OCR_ENABLED false Enable OCR globally
OCR_SERVICE_URL http://localhost:8000 PaddleOCR service URL
OCR_ENDPOINT /ocr OCR API endpoint path
OCR_TIMEOUT_SECONDS 120 Request timeout
OCR_IMAGE blazordevlab/paddleocrapi:latest Docker image to use
OCR_LANGUAGE ch Default OCR language
OCR_USE_GPU false Enable GPU acceleration

API Endpoints

OCR Configuration

Manage OCR settings per company.

Get OCR Configuration

GET /api/companies/{companyId}/ocr-config

Response:

{
  "data": {
    "companyId": 1,
    "companyName": "Acme Corp",
    "ocrEnabled": true,
    "ocrLanguage": "en",
    "ocrAutoProcess": true,
    "ocrFileTypes": "pdf,png,jpg,jpeg,tiff,docx,doc",
    "supportedFileTypes": ["pdf", "png", "jpg", "jpeg", "tiff", "docx", "doc"]
  }
}

Update OCR Configuration

PUT /api/companies/{companyId}/ocr-config
Authorization: Bearer {token}
Content-Type: application/json

Request Body:

{
  "ocrEnabled": true,
  "ocrLanguage": "en",
  "ocrAutoProcess": true,
  "ocrFileTypes": "pdf,png,jpg"
}

Field Type Description
ocrEnabled boolean Enable/disable OCR for company
ocrLanguage string Default language code (2-5 chars)
ocrAutoProcess boolean Auto-process uploads with OCR
ocrFileTypes string Comma-separated file extensions

Response:

{
  "data": {
    "companyId": 1,
    "companyName": "Acme Corp",
    "ocrEnabled": true,
    "ocrLanguage": "en",
    "ocrAutoProcess": true,
    "ocrFileTypes": "pdf,png,jpg",
    "supportedFileTypes": ["pdf", "png", "jpg"]
  },
  "message": "OCR configuration updated successfully"
}

Enable OCR

POST /api/companies/{companyId}/ocr-config/enable
Authorization: Bearer {token}

Required Role: ADMIN or SUPER_ADMIN

Disable OCR

POST /api/companies/{companyId}/ocr-config/disable
Authorization: Bearer {token}

Required Role: ADMIN or SUPER_ADMIN

Get OCR Service Status

Check if the OCR service is available.

GET /api/companies/ocr-status

Response:

{
  "data": {
    "serviceAvailable": true,
    "serviceUrl": "http://ocr-service:8000",
    "serviceVersion": "PaddleOCR",
    "supportedLanguages": [
      {"code": "ch", "name": "Chinese"},
      {"code": "en", "name": "English"},
      {"code": "fr", "name": "French"}
    ],
    "defaultLanguage": "ch"
  }
}


Document Analysis (Company Admin)

View and manage OCR document analyses for your organization. These endpoints are designed for company administrators to monitor OCR processing status through a sidebar menu in Organization Settings.

Base URL: /api/companies/{companyId}/document-analysis

Required Role: Authenticated user (company member)

Get Analysis Summary

Get dashboard statistics for document analysis in your organization.

GET /api/companies/{companyId}/document-analysis/summary
Authorization: Bearer {token}

Response:

{
  "data": {
    "companyId": 1,
    "companyName": "Acme Corp",
    "ocrEnabled": true,
    "ocrLanguage": "en",
    "ocrAutoProcess": true,
    "totalDocuments": 150,
    "completedCount": 140,
    "failedCount": 5,
    "pendingCount": 3,
    "processingCount": 2,
    "skippedCount": 0,
    "totalCharactersExtracted": 2456789,
    "serviceAvailable": true,
    "successRate": 96.55
  }
}

List Document Analyses

Get paginated list of all document analyses with optional status filter.

GET /api/companies/{companyId}/document-analysis?status={status}
Authorization: Bearer {token}

Parameters: | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | status | string | No | Filter by status: PENDING, PROCESSING, COMPLETED, FAILED, SKIPPED | | page | int | No | Page number (default: 0) | | size | int | No | Page size (default: 20) | | sort | string | No | Sort field (e.g., createdAt,desc) |

Response:

{
  "content": [
    {
      "id": 1,
      "fileId": 123,
      "fileName": "invoice.pdf",
      "fileUrl": "http://minio:9000/tgm-uploads/companies/1/documents/invoice.pdf",
      "mimeType": "application/pdf",
      "status": "COMPLETED",
      "statusLabel": "Completed",
      "errorMessage": null,
      "textLength": 5432,
      "wordCount": 876,
      "pageCount": 3,
      "ocrConfidence": 0.9523,
      "ocrLanguage": "en",
      "processingDurationMs": 15234,
      "createdAt": "2025-02-05T10:30:00",
      "processedAt": "2025-02-05T10:30:15",
      "textPreview": "INVOICE #12345..."
    }
  ],
  "totalElements": 150,
  "totalPages": 8,
  "number": 0,
  "size": 20
}

Get Analysis Detail

Get full details of a specific document analysis.

GET /api/companies/{companyId}/document-analysis/{analysisId}?includeFullText=false
Authorization: Bearer {token}

Parameters: | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | includeFullText | boolean | false | Include full extracted text in response |

Response:

{
  "data": {
    "id": 1,
    "fileId": 123,
    "fileName": "invoice.pdf",
    "fileUrl": "http://minio:9000/tgm-uploads/companies/1/documents/invoice.pdf",
    "mimeType": "application/pdf",
    "fileSizeBytes": 262144,
    "status": "COMPLETED",
    "statusLabel": "Completed",
    "errorMessage": null,
    "extractedText": "Full text here...",
    "textPreview": "INVOICE #12345...",
    "textLength": 5432,
    "wordCount": 876,
    "pageCount": 3,
    "ocrConfidence": 0.9523,
    "ocrLanguage": "en",
    "processingDurationMs": 15234,
    "processingStartedAt": "2025-02-05T10:30:00",
    "createdAt": "2025-02-05T10:29:55",
    "processedAt": "2025-02-05T10:30:15",
    "indexedForSearch": true,
    "confidencePercent": 95,
    "processingDurationSeconds": 15.234,
    "fileSizeKb": 256.0,
    "retryable": false,
    "complete": true,
    "inProgress": false
  }
}

Retry Failed Analysis

Retry OCR processing for a failed document.

POST /api/companies/{companyId}/document-analysis/{analysisId}/retry
Authorization: Bearer {token}

Required Role: ADMIN or MANAGER

Rate Limit: 10 requests per user per 60 seconds

Response:

{
  "data": {
    "message": "Analysis retry started",
    "analysisId": 1,
    "status": "processing"
  },
  "message": "OCR processing has been queued"
}

Error Response (OCR disabled):

{
  "error": "OCR is disabled for this organization. Enable it in settings first."
}

Retry All Failed Analyses

Queue all failed documents for reprocessing.

POST /api/companies/{companyId}/document-analysis/retry-all-failed
Authorization: Bearer {token}

Required Role: ADMIN or MANAGER

Rate Limit: 3 requests per company per 5 minutes

Response:

{
  "data": {
    "message": "Retry started for all failed analyses",
    "count": 5,
    "status": "processing"
  },
  "message": "5 documents queued for reprocessing"
}

Get Queue Status

Get current OCR queue status for monitoring.

GET /api/companies/{companyId}/document-analysis/queue-status
Authorization: Bearer {token}

Required Role: ADMIN or MANAGER

Response:

{
  "data": {
    "pending": 5,
    "processing": 2,
    "completed": 150,
    "failed": 3,
    "skipped": 1,
    "total": 161
  }
}

Manually Analyze a File

Trigger OCR analysis for a specific file.

POST /api/companies/{companyId}/document-analysis/analyze/{fileId}
Authorization: Bearer {token}

Required Role: ADMIN or MANAGER

Rate Limit: 20 requests per user per 60 seconds

Response (success):

{
  "data": {
    "message": "OCR analysis started",
    "fileId": 123,
    "fileName": "contract.pdf",
    "status": "processing"
  },
  "message": "File queued for OCR processing"
}

Error Responses: - OCR disabled: "OCR is disabled for this organization. Enable it in settings first." - Service unavailable: "OCR service is not available. Please try again later." - Unsupported type: "File type not supported for OCR: video.mp4" - Already processed: "File has already been analyzed successfully"


Search within OCR-extracted document content.

Search Documents

GET /search/documents?companyId={companyId}&q={query}

Parameters: | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | companyId | long | Yes | Company ID | | q | string | Yes | Search query | | page | int | No | Page number (default: 0) | | size | int | No | Page size (default: 20) |

Response:

{
  "content": [
    {
      "id": 1,
      "fileId": 123,
      "fileType": "strapi",
      "textLength": 5432,
      "wordCount": 876,
      "ocrConfidence": 0.9523,
      "ocrLanguage": "en",
      "pageCount": 3,
      "processingStatus": "COMPLETED",
      "indexedForSearch": true,
      "companyId": 1,
      "createdAt": "2025-02-05T10:30:00",
      "processedAt": "2025-02-05T10:30:15",
      "processingDurationMs": 15234,
      "textPreview": "This is the first 500 characters of the extracted text..."
    }
  ],
  "totalElements": 25,
  "totalPages": 2,
  "number": 0,
  "size": 20
}

Get Document Content

GET /search/documents/{fileId}?includeFullText=false

Parameters: | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | includeFullText | boolean | false | Include full extracted text |


Document Admin

Administrative endpoints for managing OCR processing.

Base URL: /api/admin/documents

Required Role: ADMIN or SUPER_ADMIN

Get OCR Service Status

GET /api/admin/documents/ocr-status

Get OCR Statistics

GET /api/admin/documents/stats?companyId={companyId}

Response:

{
  "companyId": 1,
  "totalDocuments": 150,
  "statusCounts": {
    "COMPLETED": 140,
    "FAILED": 5,
    "PENDING": 3,
    "PROCESSING": 2
  },
  "totalCharactersExtracted": 2456789,
  "indexedDocuments": 140
}

List Documents

GET /api/admin/documents?companyId={companyId}&status={status}

Parameters: | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | companyId | long | Yes | Company ID | | status | string | No | Filter by status: PENDING, PROCESSING, COMPLETED, FAILED, SKIPPED | | page | int | No | Page number | | size | int | No | Page size |

Get Document Content

GET /api/admin/documents/{fileId}?includeFullText=true

Manually Process File

Trigger OCR processing for a specific file.

POST /api/admin/documents/{fileId}/process?companyId={companyId}

Response:

{
  "message": "OCR processing started for file 123",
  "status": "processing"
}

Reprocess Failed File

POST /api/admin/documents/{fileId}/reprocess

Reprocess All Failed Documents

POST /api/admin/documents/reprocess-failed?companyId={companyId}

Response:

{
  "message": "Reprocessing started for failed documents in company 1",
  "status": "processing"
}

Delete Document Content

DELETE /api/admin/documents/{fileId}

Company File Uploads (Multi-Tenancy)

Upload files to company-specific folders in MinIO for tenant isolation.

Storage Pattern: companies/{companyId}/{subFolder}/{hash}{ext}

Upload File for Company

POST /api/upload/company/{companyId}?folder={subFolder}
Content-Type: multipart/form-data

Parameters: | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | companyId | long | Yes | Company ID (path) | | files | file | Yes | File to upload | | folder | string | No | Sub-folder (e.g., "documents", "images") |

Example:

curl -X POST "http://localhost:1337/api/upload/company/1?folder=documents" \
  -H "Authorization: Bearer $TOKEN" \
  -F "files=@invoice.pdf"

Response:

[
  {
    "id": 123,
    "name": "invoice.pdf",
    "hash": "invoice_abc123def4",
    "ext": ".pdf",
    "mime": "application/pdf",
    "size": 256.5,
    "url": "http://minio:9000/tgm-uploads/companies/1/documents/invoice_abc123def4.pdf",
    "provider": "minio",
    "folderPath": "/companies/1/documents"
  }
]

Upload Multiple Files for Company

POST /api/upload/company/{companyId}/multiple?folder={subFolder}
Content-Type: multipart/form-data
POST /api/upload/company/{companyId}/link
Content-Type: multipart/form-data

Parameters: | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | companyId | long | Yes | Company ID (path) | | files | file | Yes | File to upload | | folder | string | No | Sub-folder | | ref | string | Yes | Entity type (e.g., api::article.article) | | refId | long | Yes | Entity ID | | field | string | Yes | Field name (e.g., attachments) |


Database Schema

Company Table Additions

ALTER TABLE companies
    ADD COLUMN ocr_enabled BOOLEAN DEFAULT FALSE,
    ADD COLUMN ocr_language VARCHAR(10) DEFAULT 'en',
    ADD COLUMN ocr_auto_process BOOLEAN DEFAULT TRUE,
    ADD COLUMN ocr_file_types TEXT DEFAULT 'pdf,png,jpg,jpeg,tiff,docx,doc';

Document Contents Table

CREATE TABLE document_contents (
    id BIGSERIAL PRIMARY KEY,
    file_id BIGINT NOT NULL,
    file_type VARCHAR(20) DEFAULT 'strapi',
    extracted_text TEXT,
    text_length INTEGER,
    ocr_confidence DECIMAL(5,4),
    ocr_language VARCHAR(10),
    page_count INTEGER DEFAULT 1,
    word_count INTEGER,
    processing_status VARCHAR(20) DEFAULT 'pending',
    error_message TEXT,
    processing_started_at TIMESTAMP,
    processing_duration_ms INTEGER,
    indexed_for_search BOOLEAN DEFAULT FALSE,
    search_embedding_id BIGINT,
    company_id BIGINT NOT NULL REFERENCES companies(id),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    processed_at TIMESTAMP
);

-- Full-text search index
CREATE INDEX idx_document_contents_text_search
    ON document_contents USING gin(to_tsvector('english', COALESCE(extracted_text, '')));

Processing Status Values

Status Description
PENDING Queued for processing
PROCESSING Currently being processed
COMPLETED Successfully processed
FAILED Processing failed (see error_message)
SKIPPED Skipped (unsupported type or OCR disabled)

Docker Setup

Starting OCR Service

# Start all services including OCR
docker-compose --profile ocr up -d

# Or start only OCR service
docker-compose --profile ocr up -d ocr-service

Using Alternative Docker Images

# Use bvdcode/paddleocrapi
OCR_IMAGE=bvdcode/paddleocrapi:latest docker-compose --profile ocr up -d

# Use m986883511/paddleocr:api (GPU support)
OCR_IMAGE=m986883511/paddleocr:api docker-compose --profile ocr up -d

Docker Compose Configuration

ocr-service:
  image: ${OCR_IMAGE:-blazordevlab/paddleocrapi:latest}
  container_name: tgm-ocr-service
  environment:
    - LANG=${OCR_LANGUAGE:-ch}
    - USE_GPU=${OCR_USE_GPU:-false}
  ports:
    - "8000:8000"
  networks:
    - tgm-network
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8000/"]
    interval: 30s
    timeout: 10s
    start_period: 120s
    retries: 3
  profiles:
    - ocr

GPU Support

For GPU acceleration, uncomment the deploy section in docker-compose.yml:

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]

Usage Examples

Enable OCR for a Company

# Enable OCR
curl -X POST "http://localhost:1337/api/companies/1/ocr-config/enable" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Configure OCR settings
curl -X PUT "http://localhost:1337/api/companies/1/ocr-config" \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "ocrEnabled": true,
    "ocrLanguage": "en",
    "ocrAutoProcess": true,
    "ocrFileTypes": "pdf,png,jpg,docx"
  }'

Upload Document with OCR

# Upload to company folder (OCR auto-triggered if enabled)
curl -X POST "http://localhost:1337/api/upload/company/1?folder=invoices" \
  -H "Authorization: Bearer $TOKEN" \
  -F "files=@invoice.pdf"

Search Document Content

# Search for documents containing "invoice"
curl "http://localhost:1337/search/documents?companyId=1&q=invoice" \
  -H "Authorization: Bearer $TOKEN"

Manually Trigger OCR

# Process a specific file
curl -X POST "http://localhost:1337/api/admin/documents/123/process?companyId=1" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Reprocess all failed documents
curl -X POST "http://localhost:1337/api/admin/documents/reprocess-failed?companyId=1" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Check OCR Status

# Check service health
curl "http://localhost:1337/api/companies/ocr-status" \
  -H "Authorization: Bearer $TOKEN"

# Get processing statistics
curl "http://localhost:1337/api/admin/documents/stats?companyId=1" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

View Document Analysis (Company Admin)

# Get analysis summary for dashboard
curl "http://localhost:1337/api/companies/1/document-analysis/summary" \
  -H "Authorization: Bearer $TOKEN"

# List all analyses
curl "http://localhost:1337/api/companies/1/document-analysis" \
  -H "Authorization: Bearer $TOKEN"

# List only failed analyses
curl "http://localhost:1337/api/companies/1/document-analysis?status=FAILED" \
  -H "Authorization: Bearer $TOKEN"

# Get detail for specific analysis
curl "http://localhost:1337/api/companies/1/document-analysis/123?includeFullText=true" \
  -H "Authorization: Bearer $TOKEN"

# Retry a failed analysis
curl -X POST "http://localhost:1337/api/companies/1/document-analysis/123/retry" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Retry all failed analyses
curl -X POST "http://localhost:1337/api/companies/1/document-analysis/retry-all-failed" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Manually trigger analysis for a file
curl -X POST "http://localhost:1337/api/companies/1/document-analysis/analyze/456" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Frontend Integration

Document Analysis Sidebar Menu

The Document Analysis feature is designed to be displayed as a sidebar menu item in Organization Settings. Here's the recommended UI structure:

Sidebar Menu Entry:

Organization Settings
├── General
├── Members
├── Document Analysis  <-- New entry
└── ...

Dashboard View (/org/:companyId/document-analysis):

Display the summary statistics from GET /document-analysis/summary: - Total documents processed - Success rate percentage - Status breakdown (completed, failed, pending, processing) - OCR enabled/disabled status - Service availability indicator

List View (/org/:companyId/document-analysis/list):

Display paginated table from GET /document-analysis: | Column | Description | |--------|-------------| | File Name | Link to file preview | | Status | Badge with color (green=completed, red=failed, yellow=pending) | | Words | Word count extracted | | Confidence | OCR confidence percentage | | Processed | When processing completed | | Actions | Retry button (for failed/skipped) |

Filters: - Status dropdown (All, Completed, Failed, Pending, Processing, Skipped) - Date range picker

Detail View (/org/:companyId/document-analysis/:id):

Show full analysis details including: - File preview/download link - Processing timeline - Extracted text (with copy button) - OCR confidence score - Error message (if failed) - Retry button

Status Badge Colors

Status Color Icon
COMPLETED Green
FAILED Red
PENDING Yellow
PROCESSING Blue
SKIPPED Gray

Troubleshooting

OCR Service Not Available

  1. Check if OCR is enabled globally: OCR_ENABLED=true
  2. Verify OCR service is running: docker-compose --profile ocr ps
  3. Check service health: curl http://localhost:8000/

Processing Failures

  1. Check document status via admin API
  2. Review error messages in document_contents.error_message
  3. Check OCR service logs: docker-compose logs ocr-service

Slow Processing

  1. Consider enabling GPU: OCR_USE_GPU=true
  2. Increase timeout: OCR_TIMEOUT_SECONDS=300
  3. Check file size limits in OCR service configuration

Testing

The Document Analysis feature includes comprehensive test coverage:

Unit Tests

DTO Tests (DocumentAnalysisDtoTest.java) - 32 tests - DocumentAnalysisSummaryDto: Builder pattern, getters/setters, success rate calculation - DocumentAnalysisItemDto: Confidence percent calculation, processing duration, retryable status - DocumentAnalysisDetailDto: File size conversion, status helpers (isComplete, isInProgress, isRetryable) - Edge cases: Zero values, null handling, large numbers

Controller Unit Tests (DocumentAnalysisControllerTest.java) - 22 tests - GET /summary - Returns correct summary with mocked service - GET / - Pagination, status filtering, file info mapping - GET /{analysisId} - Detail retrieval with/without full text - POST /{analysisId}/retry - Authorization, status validation - POST /retry-all-failed - Batch retry with count - POST /analyze/{fileId} - File validation, OCR checks

Integration Tests

Controller Integration Tests (DocumentAnalysisControllerIntegrationTest.java) - 28 tests

Uses Testcontainers for real database testing:

Test Category Tests Description
GetSummary 3 Summary stats, auth, 404 handling
ListAnalyses 5 Pagination, COMPLETED/FAILED/PENDING filters, file info
GetAnalysisDetail 4 Detail with/without text, company isolation
RetryAnalysis 4 Admin retry, regular user 403, status validation
RetryAllFailed 3 Batch retry, zero count, authorization
AnalyzeFile 4 OCR disabled, service unavailable, authorization
DataIntegrity 2 Company isolation, database consistency
EdgeCases 3 Empty company, pagination edge cases

Running Tests

# Run all tests
mvn test

# Run only Document Analysis tests
mvn test -Dtest="DocumentAnalysis*"

# Run only integration tests
mvn test -Dtest="DocumentAnalysisControllerIntegrationTest"

# Run only DTO tests
mvn test -Dtest="DocumentAnalysisDtoTest"

Test Data Setup

Integration tests create: - Test company with OCR enabled - Regular user (Authenticated role) - Admin user (Admin role) - Sample files (PDF) in different states: - COMPLETED: With extracted text, confidence, word count - FAILED: With error message - PENDING: Awaiting processing


Production Readiness

Implemented Features

Feature Status Notes
Authorization Role-based: Authenticated (view), ADMIN/MANAGER (actions)
Multi-tenancy Company isolation via company_id foreign key
Error Handling Meaningful messages for all failure scenarios
Async Processing @Async("ocrExecutor") for non-blocking OCR
Pagination Spring Data pagination with status filtering + max page size (100)
Rate Limiting Redis-based rate limiting on retry/analyze endpoints
Metrics & Monitoring Micrometer metrics for Prometheus/Actuator
API Documentation OpenAPI annotations on all endpoints
Logging Structured logging with SLF4J
Test Coverage 117+ tests (DTO + unit + integration)

Security Considerations

Aspect Status Details
Authentication JWT required for all endpoints
Authorization @PreAuthorize on all endpoints
Company Isolation Documents filtered by companyId parameter
Input Validation Status enum validated, max page size enforced
Rate Limiting Configurable per-user and per-company limits

Rate Limiting Configuration

Rate limits are configurable via application.yml:

app:
  ocr:
    rate-limit:
      retry:
        max-requests: 10        # Max retries per user per minute
        window-seconds: 60
      analyze:
        max-requests: 20        # Max analyze requests per user per minute
        window-seconds: 60
      retry-all:
        max-requests: 3         # Max batch retries per company per 5 minutes
        window-seconds: 300
    pagination:
      max-page-size: 100        # Max items per page

Monitoring & Metrics

OCR processing exposes Prometheus metrics via Spring Actuator:

Counters: | Metric | Description | |--------|-------------| | ocr.documents.processed.total | Total documents submitted for processing | | ocr.documents.success.total | Successful OCR extractions | | ocr.documents.failed.total | Failed OCR extractions | | ocr.retry.total | Manual retry requests | | ocr.retry.all.total | Batch retry requests |

Gauges: | Metric | Description | |--------|-------------| | ocr.queue.pending | Documents awaiting processing | | ocr.queue.processing | Documents currently being processed | | ocr.queue.failed | Documents with failed status |

Timer: | Metric | Description | |--------|-------------| | ocr.processing.duration | Processing time (p50, p75, p95, p99) |

Queue Status Endpoint:

GET /api/companies/{companyId}/document-analysis/queue-status

Returns:

{
  "data": {
    "pending": 5,
    "processing": 2,
    "completed": 150,
    "failed": 3,
    "skipped": 1,
    "total": 161
  }
}

Health Endpoint:

GET /actuator/health

Returns comprehensive health status including OCR service (when enabled):

{
  "status": "UP",
  "components": {
    "application": {
      "status": "UP",
      "details": {
        "name": "TGM Manager Server",
        "version": "1.1.0",
        "features": {
          "ocr": true,
          "influxDb": false,
          "semanticSearch": false,
          "email": true,
          "sms": false,
          "sso": true,
          "license": true,
          "rabbitmq": true,
          "ragflow": false,
          "cron": true
        }
      }
    },
    "db": { "status": "UP" },
    "minio": { "status": "UP", "details": { "bucket": "tgm-uploads" } },
    "rabbit": { "status": "UP", "details": { "version": "3.12.14" } },
    "redis": { "status": "UP", "details": { "version": "8.4.0" } },
    "ocr": { "status": "UP", "details": { "serviceUrl": "http://localhost:8000" } },
    "influxdb": { "status": "UP", "details": { "org": "ensolutions" } }
  }
}

Performance

Aspect Implementation
Async OCR Dedicated ocrExecutor thread pool
Database Indexes Full-text search index on extracted_text
Lazy Loading File info fetched separately per document
Pagination Server-side with enforced max size (100)
Rate Limiting Redis-based via CacheService

Alerting Recommendations

Use the metrics to set up alerts:

# Prometheus alerting rules
groups:
  - name: ocr-alerts
    rules:
      - alert: OcrQueueBacklogHigh
        expr: ocr_queue_pending > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "OCR queue backlog is high ({{ $value }} pending)"

      - alert: OcrHighFailureRate
        expr: rate(ocr_documents_failed_total[5m]) / rate(ocr_documents_processed_total[5m]) > 0.3
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "OCR failure rate exceeds 30%"