Ocr document analysis
This document describes the OCR (Optical Character Recognition) integration for extracting text from uploaded documents using PaddleOCR.
Table of Contents¶
- Overview
- Multi-Tenancy Context
- Architecture
- Configuration
- API Endpoints
- OCR Configuration
- Document Analysis (Company Admin)
- Document Search
- Document Admin
- Company File Uploads
- Database Schema
- Docker Setup
- Usage Examples
Overview¶
The OCR integration enables automatic text extraction from uploaded documents (PDFs, images, Word documents). Key features:
- Per-company OCR configuration - Enable/disable OCR per company
- Automatic processing - OCR runs automatically on file upload when enabled
- Full-text search - Search within extracted document content
- Multi-tenancy support - Client-isolated databases with company-isolated file storage
Multi-Tenancy Context¶
OCR configuration and processing operates within TGM's multi-tenant architecture:
┌─────────────────────────────────────────────────────────────────┐
│ Platform Level │
│ OCR Service (PaddleOCR Docker) - Shared across all clients │
│ Global OCR Enable/Disable (app.ocr.enabled) │
└─────────────────────────────────────────────────────────────────┘
│
┌──────────────────────┼──────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Client A │ │ Client B │ │ Client C │
│ Database │ │ Database │ │ Database │
├───────────────┤ ├───────────────┤ ├───────────────┤
│ Company │ │ Company │ │ Company │
│ ocrEnabled: │ │ ocrEnabled: │ │ ocrEnabled: │
│ true │ │ false │ │ true │
│ ocrLanguage: │ │ │ │ ocrLanguage: │
│ 'en' │ │ │ │ 'fr' │
├───────────────┤ └───────────────┘ ├───────────────┤
│ document_ │ │ document_ │
│ contents │ │ contents │
│ (table) │ │ (table) │
└───────────────┘ └───────────────┘
Key Points¶
- OCR Service is shared across all clients (single Docker container)
- OCR Configuration is per-company within each client's database
- Document Contents are stored in each client's isolated database
- Files are stored in company-specific folders in storage (MinIO/S3)
API Request Headers¶
All OCR API requests require multi-tenant headers:
curl -X GET "http://localhost:1337/api/companies/1/ocr-config" \
-H "Authorization: Bearer $JWT" \
-H "X-Client-ID: acme-corp"
| Header | Required | Description |
|---|---|---|
Authorization |
Yes | JWT or API token |
X-Client-ID |
Yes* | Client identifier |
X-Tenant-ID |
No | Sandbox (omit for production) |
*Can be resolved via subdomain: acme-corp.tgm-expert.com
Supported File Types¶
| Type | Extensions |
|---|---|
| Images | PNG, JPG, JPEG, TIFF, BMP, WEBP |
| Documents | PDF, DOCX, DOC |
Supported Languages¶
| Code | Language |
|---|---|
ch |
Chinese (Simplified) |
en |
English |
fr |
French |
german |
German |
korean |
Korean |
japan |
Japanese |
Architecture¶
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ File Upload │────▶│ Spring Boot │────▶│ PaddleOCR │
│ (Frontend) │ │ (Java) │ │ (Docker) │
└─────────────────┘ └────────┬────────┘ └─────────────────┘
│
┌────────────┴────────────┐
▼ ▼
┌──────────┐ ┌───────────────┐
│PostgreSQL│ │ MinIO │
│(content) │ │ (file storage)│
└──────────┘ └───────────────┘
Flow:
1. User uploads file via API
2. File stored in MinIO under company-specific folder
3. If OCR enabled for company, async processing triggered
4. PaddleOCR extracts text from document
5. Extracted text stored in document_contents table
6. Text indexed for full-text search
Configuration¶
Application Properties¶
app:
ocr:
# Enable/disable OCR globally
enabled: ${OCR_ENABLED:false}
# OCR service URL
service-url: ${OCR_SERVICE_URL:http://localhost:8000}
# OCR API endpoint path
endpoint: ${OCR_ENDPOINT:/ocr}
# Request timeout in seconds
timeout-seconds: ${OCR_TIMEOUT_SECONDS:120}
Environment Variables¶
| Variable | Default | Description |
|---|---|---|
OCR_ENABLED |
false |
Enable OCR globally |
OCR_SERVICE_URL |
http://localhost:8000 |
PaddleOCR service URL |
OCR_ENDPOINT |
/ocr |
OCR API endpoint path |
OCR_TIMEOUT_SECONDS |
120 |
Request timeout |
OCR_IMAGE |
blazordevlab/paddleocrapi:latest |
Docker image to use |
OCR_LANGUAGE |
ch |
Default OCR language |
OCR_USE_GPU |
false |
Enable GPU acceleration |
API Endpoints¶
OCR Configuration¶
Manage OCR settings per company.
Get OCR Configuration¶
GET /api/companies/{companyId}/ocr-config
Response:
{
"data": {
"companyId": 1,
"companyName": "Acme Corp",
"ocrEnabled": true,
"ocrLanguage": "en",
"ocrAutoProcess": true,
"ocrFileTypes": "pdf,png,jpg,jpeg,tiff,docx,doc",
"supportedFileTypes": ["pdf", "png", "jpg", "jpeg", "tiff", "docx", "doc"]
}
}
Update OCR Configuration¶
PUT /api/companies/{companyId}/ocr-config
Authorization: Bearer {token}
Content-Type: application/json
Request Body:
{
"ocrEnabled": true,
"ocrLanguage": "en",
"ocrAutoProcess": true,
"ocrFileTypes": "pdf,png,jpg"
}
| Field | Type | Description |
|---|---|---|
ocrEnabled |
boolean | Enable/disable OCR for company |
ocrLanguage |
string | Default language code (2-5 chars) |
ocrAutoProcess |
boolean | Auto-process uploads with OCR |
ocrFileTypes |
string | Comma-separated file extensions |
Response:
{
"data": {
"companyId": 1,
"companyName": "Acme Corp",
"ocrEnabled": true,
"ocrLanguage": "en",
"ocrAutoProcess": true,
"ocrFileTypes": "pdf,png,jpg",
"supportedFileTypes": ["pdf", "png", "jpg"]
},
"message": "OCR configuration updated successfully"
}
Enable OCR¶
POST /api/companies/{companyId}/ocr-config/enable
Authorization: Bearer {token}
Required Role: ADMIN or SUPER_ADMIN
Disable OCR¶
POST /api/companies/{companyId}/ocr-config/disable
Authorization: Bearer {token}
Required Role: ADMIN or SUPER_ADMIN
Get OCR Service Status¶
Check if the OCR service is available.
GET /api/companies/ocr-status
Response:
{
"data": {
"serviceAvailable": true,
"serviceUrl": "http://ocr-service:8000",
"serviceVersion": "PaddleOCR",
"supportedLanguages": [
{"code": "ch", "name": "Chinese"},
{"code": "en", "name": "English"},
{"code": "fr", "name": "French"}
],
"defaultLanguage": "ch"
}
}
Document Analysis (Company Admin)¶
View and manage OCR document analyses for your organization. These endpoints are designed for company administrators to monitor OCR processing status through a sidebar menu in Organization Settings.
Base URL: /api/companies/{companyId}/document-analysis
Required Role: Authenticated user (company member)
Get Analysis Summary¶
Get dashboard statistics for document analysis in your organization.
GET /api/companies/{companyId}/document-analysis/summary
Authorization: Bearer {token}
Response:
{
"data": {
"companyId": 1,
"companyName": "Acme Corp",
"ocrEnabled": true,
"ocrLanguage": "en",
"ocrAutoProcess": true,
"totalDocuments": 150,
"completedCount": 140,
"failedCount": 5,
"pendingCount": 3,
"processingCount": 2,
"skippedCount": 0,
"totalCharactersExtracted": 2456789,
"serviceAvailable": true,
"successRate": 96.55
}
}
List Document Analyses¶
Get paginated list of all document analyses with optional status filter.
GET /api/companies/{companyId}/document-analysis?status={status}
Authorization: Bearer {token}
Parameters:
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| status | string | No | Filter by status: PENDING, PROCESSING, COMPLETED, FAILED, SKIPPED |
| page | int | No | Page number (default: 0) |
| size | int | No | Page size (default: 20) |
| sort | string | No | Sort field (e.g., createdAt,desc) |
Response:
{
"content": [
{
"id": 1,
"fileId": 123,
"fileName": "invoice.pdf",
"fileUrl": "http://minio:9000/tgm-uploads/companies/1/documents/invoice.pdf",
"mimeType": "application/pdf",
"status": "COMPLETED",
"statusLabel": "Completed",
"errorMessage": null,
"textLength": 5432,
"wordCount": 876,
"pageCount": 3,
"ocrConfidence": 0.9523,
"ocrLanguage": "en",
"processingDurationMs": 15234,
"createdAt": "2025-02-05T10:30:00",
"processedAt": "2025-02-05T10:30:15",
"textPreview": "INVOICE #12345..."
}
],
"totalElements": 150,
"totalPages": 8,
"number": 0,
"size": 20
}
Get Analysis Detail¶
Get full details of a specific document analysis.
GET /api/companies/{companyId}/document-analysis/{analysisId}?includeFullText=false
Authorization: Bearer {token}
Parameters:
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| includeFullText | boolean | false | Include full extracted text in response |
Response:
{
"data": {
"id": 1,
"fileId": 123,
"fileName": "invoice.pdf",
"fileUrl": "http://minio:9000/tgm-uploads/companies/1/documents/invoice.pdf",
"mimeType": "application/pdf",
"fileSizeBytes": 262144,
"status": "COMPLETED",
"statusLabel": "Completed",
"errorMessage": null,
"extractedText": "Full text here...",
"textPreview": "INVOICE #12345...",
"textLength": 5432,
"wordCount": 876,
"pageCount": 3,
"ocrConfidence": 0.9523,
"ocrLanguage": "en",
"processingDurationMs": 15234,
"processingStartedAt": "2025-02-05T10:30:00",
"createdAt": "2025-02-05T10:29:55",
"processedAt": "2025-02-05T10:30:15",
"indexedForSearch": true,
"confidencePercent": 95,
"processingDurationSeconds": 15.234,
"fileSizeKb": 256.0,
"retryable": false,
"complete": true,
"inProgress": false
}
}
Retry Failed Analysis¶
Retry OCR processing for a failed document.
POST /api/companies/{companyId}/document-analysis/{analysisId}/retry
Authorization: Bearer {token}
Required Role: ADMIN or MANAGER
Rate Limit: 10 requests per user per 60 seconds
Response:
{
"data": {
"message": "Analysis retry started",
"analysisId": 1,
"status": "processing"
},
"message": "OCR processing has been queued"
}
Error Response (OCR disabled):
{
"error": "OCR is disabled for this organization. Enable it in settings first."
}
Retry All Failed Analyses¶
Queue all failed documents for reprocessing.
POST /api/companies/{companyId}/document-analysis/retry-all-failed
Authorization: Bearer {token}
Required Role: ADMIN or MANAGER
Rate Limit: 3 requests per company per 5 minutes
Response:
{
"data": {
"message": "Retry started for all failed analyses",
"count": 5,
"status": "processing"
},
"message": "5 documents queued for reprocessing"
}
Get Queue Status¶
Get current OCR queue status for monitoring.
GET /api/companies/{companyId}/document-analysis/queue-status
Authorization: Bearer {token}
Required Role: ADMIN or MANAGER
Response:
{
"data": {
"pending": 5,
"processing": 2,
"completed": 150,
"failed": 3,
"skipped": 1,
"total": 161
}
}
Manually Analyze a File¶
Trigger OCR analysis for a specific file.
POST /api/companies/{companyId}/document-analysis/analyze/{fileId}
Authorization: Bearer {token}
Required Role: ADMIN or MANAGER
Rate Limit: 20 requests per user per 60 seconds
Response (success):
{
"data": {
"message": "OCR analysis started",
"fileId": 123,
"fileName": "contract.pdf",
"status": "processing"
},
"message": "File queued for OCR processing"
}
Error Responses:
- OCR disabled: "OCR is disabled for this organization. Enable it in settings first."
- Service unavailable: "OCR service is not available. Please try again later."
- Unsupported type: "File type not supported for OCR: video.mp4"
- Already processed: "File has already been analyzed successfully"
Document Search¶
Search within OCR-extracted document content.
Search Documents¶
GET /search/documents?companyId={companyId}&q={query}
Parameters:
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| companyId | long | Yes | Company ID |
| q | string | Yes | Search query |
| page | int | No | Page number (default: 0) |
| size | int | No | Page size (default: 20) |
Response:
{
"content": [
{
"id": 1,
"fileId": 123,
"fileType": "strapi",
"textLength": 5432,
"wordCount": 876,
"ocrConfidence": 0.9523,
"ocrLanguage": "en",
"pageCount": 3,
"processingStatus": "COMPLETED",
"indexedForSearch": true,
"companyId": 1,
"createdAt": "2025-02-05T10:30:00",
"processedAt": "2025-02-05T10:30:15",
"processingDurationMs": 15234,
"textPreview": "This is the first 500 characters of the extracted text..."
}
],
"totalElements": 25,
"totalPages": 2,
"number": 0,
"size": 20
}
Get Document Content¶
GET /search/documents/{fileId}?includeFullText=false
Parameters:
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| includeFullText | boolean | false | Include full extracted text |
Document Admin¶
Administrative endpoints for managing OCR processing.
Base URL: /api/admin/documents
Required Role: ADMIN or SUPER_ADMIN
Get OCR Service Status¶
GET /api/admin/documents/ocr-status
Get OCR Statistics¶
GET /api/admin/documents/stats?companyId={companyId}
Response:
{
"companyId": 1,
"totalDocuments": 150,
"statusCounts": {
"COMPLETED": 140,
"FAILED": 5,
"PENDING": 3,
"PROCESSING": 2
},
"totalCharactersExtracted": 2456789,
"indexedDocuments": 140
}
List Documents¶
GET /api/admin/documents?companyId={companyId}&status={status}
Parameters:
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| companyId | long | Yes | Company ID |
| status | string | No | Filter by status: PENDING, PROCESSING, COMPLETED, FAILED, SKIPPED |
| page | int | No | Page number |
| size | int | No | Page size |
Get Document Content¶
GET /api/admin/documents/{fileId}?includeFullText=true
Manually Process File¶
Trigger OCR processing for a specific file.
POST /api/admin/documents/{fileId}/process?companyId={companyId}
Response:
{
"message": "OCR processing started for file 123",
"status": "processing"
}
Reprocess Failed File¶
POST /api/admin/documents/{fileId}/reprocess
Reprocess All Failed Documents¶
POST /api/admin/documents/reprocess-failed?companyId={companyId}
Response:
{
"message": "Reprocessing started for failed documents in company 1",
"status": "processing"
}
Delete Document Content¶
DELETE /api/admin/documents/{fileId}
Company File Uploads (Multi-Tenancy)¶
Upload files to company-specific folders in MinIO for tenant isolation.
Storage Pattern: companies/{companyId}/{subFolder}/{hash}{ext}
Upload File for Company¶
POST /api/upload/company/{companyId}?folder={subFolder}
Content-Type: multipart/form-data
Parameters:
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| companyId | long | Yes | Company ID (path) |
| files | file | Yes | File to upload |
| folder | string | No | Sub-folder (e.g., "documents", "images") |
Example:
curl -X POST "http://localhost:1337/api/upload/company/1?folder=documents" \
-H "Authorization: Bearer $TOKEN" \
-F "files=@invoice.pdf"
Response:
[
{
"id": 123,
"name": "invoice.pdf",
"hash": "invoice_abc123def4",
"ext": ".pdf",
"mime": "application/pdf",
"size": 256.5,
"url": "http://minio:9000/tgm-uploads/companies/1/documents/invoice_abc123def4.pdf",
"provider": "minio",
"folderPath": "/companies/1/documents"
}
]
Upload Multiple Files for Company¶
POST /api/upload/company/{companyId}/multiple?folder={subFolder}
Content-Type: multipart/form-data
Upload and Link File for Company¶
POST /api/upload/company/{companyId}/link
Content-Type: multipart/form-data
Parameters:
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| companyId | long | Yes | Company ID (path) |
| files | file | Yes | File to upload |
| folder | string | No | Sub-folder |
| ref | string | Yes | Entity type (e.g., api::article.article) |
| refId | long | Yes | Entity ID |
| field | string | Yes | Field name (e.g., attachments) |
Database Schema¶
Company Table Additions¶
ALTER TABLE companies
ADD COLUMN ocr_enabled BOOLEAN DEFAULT FALSE,
ADD COLUMN ocr_language VARCHAR(10) DEFAULT 'en',
ADD COLUMN ocr_auto_process BOOLEAN DEFAULT TRUE,
ADD COLUMN ocr_file_types TEXT DEFAULT 'pdf,png,jpg,jpeg,tiff,docx,doc';
Document Contents Table¶
CREATE TABLE document_contents (
id BIGSERIAL PRIMARY KEY,
file_id BIGINT NOT NULL,
file_type VARCHAR(20) DEFAULT 'strapi',
extracted_text TEXT,
text_length INTEGER,
ocr_confidence DECIMAL(5,4),
ocr_language VARCHAR(10),
page_count INTEGER DEFAULT 1,
word_count INTEGER,
processing_status VARCHAR(20) DEFAULT 'pending',
error_message TEXT,
processing_started_at TIMESTAMP,
processing_duration_ms INTEGER,
indexed_for_search BOOLEAN DEFAULT FALSE,
search_embedding_id BIGINT,
company_id BIGINT NOT NULL REFERENCES companies(id),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
processed_at TIMESTAMP
);
-- Full-text search index
CREATE INDEX idx_document_contents_text_search
ON document_contents USING gin(to_tsvector('english', COALESCE(extracted_text, '')));
Processing Status Values¶
| Status | Description |
|---|---|
PENDING |
Queued for processing |
PROCESSING |
Currently being processed |
COMPLETED |
Successfully processed |
FAILED |
Processing failed (see error_message) |
SKIPPED |
Skipped (unsupported type or OCR disabled) |
Docker Setup¶
Starting OCR Service¶
# Start all services including OCR
docker-compose --profile ocr up -d
# Or start only OCR service
docker-compose --profile ocr up -d ocr-service
Using Alternative Docker Images¶
# Use bvdcode/paddleocrapi
OCR_IMAGE=bvdcode/paddleocrapi:latest docker-compose --profile ocr up -d
# Use m986883511/paddleocr:api (GPU support)
OCR_IMAGE=m986883511/paddleocr:api docker-compose --profile ocr up -d
Docker Compose Configuration¶
ocr-service:
image: ${OCR_IMAGE:-blazordevlab/paddleocrapi:latest}
container_name: tgm-ocr-service
environment:
- LANG=${OCR_LANGUAGE:-ch}
- USE_GPU=${OCR_USE_GPU:-false}
ports:
- "8000:8000"
networks:
- tgm-network
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/"]
interval: 30s
timeout: 10s
start_period: 120s
retries: 3
profiles:
- ocr
GPU Support¶
For GPU acceleration, uncomment the deploy section in docker-compose.yml:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Usage Examples¶
Enable OCR for a Company¶
# Enable OCR
curl -X POST "http://localhost:1337/api/companies/1/ocr-config/enable" \
-H "Authorization: Bearer $ADMIN_TOKEN"
# Configure OCR settings
curl -X PUT "http://localhost:1337/api/companies/1/ocr-config" \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"ocrEnabled": true,
"ocrLanguage": "en",
"ocrAutoProcess": true,
"ocrFileTypes": "pdf,png,jpg,docx"
}'
Upload Document with OCR¶
# Upload to company folder (OCR auto-triggered if enabled)
curl -X POST "http://localhost:1337/api/upload/company/1?folder=invoices" \
-H "Authorization: Bearer $TOKEN" \
-F "files=@invoice.pdf"
Search Document Content¶
# Search for documents containing "invoice"
curl "http://localhost:1337/search/documents?companyId=1&q=invoice" \
-H "Authorization: Bearer $TOKEN"
Manually Trigger OCR¶
# Process a specific file
curl -X POST "http://localhost:1337/api/admin/documents/123/process?companyId=1" \
-H "Authorization: Bearer $ADMIN_TOKEN"
# Reprocess all failed documents
curl -X POST "http://localhost:1337/api/admin/documents/reprocess-failed?companyId=1" \
-H "Authorization: Bearer $ADMIN_TOKEN"
Check OCR Status¶
# Check service health
curl "http://localhost:1337/api/companies/ocr-status" \
-H "Authorization: Bearer $TOKEN"
# Get processing statistics
curl "http://localhost:1337/api/admin/documents/stats?companyId=1" \
-H "Authorization: Bearer $ADMIN_TOKEN"
View Document Analysis (Company Admin)¶
# Get analysis summary for dashboard
curl "http://localhost:1337/api/companies/1/document-analysis/summary" \
-H "Authorization: Bearer $TOKEN"
# List all analyses
curl "http://localhost:1337/api/companies/1/document-analysis" \
-H "Authorization: Bearer $TOKEN"
# List only failed analyses
curl "http://localhost:1337/api/companies/1/document-analysis?status=FAILED" \
-H "Authorization: Bearer $TOKEN"
# Get detail for specific analysis
curl "http://localhost:1337/api/companies/1/document-analysis/123?includeFullText=true" \
-H "Authorization: Bearer $TOKEN"
# Retry a failed analysis
curl -X POST "http://localhost:1337/api/companies/1/document-analysis/123/retry" \
-H "Authorization: Bearer $ADMIN_TOKEN"
# Retry all failed analyses
curl -X POST "http://localhost:1337/api/companies/1/document-analysis/retry-all-failed" \
-H "Authorization: Bearer $ADMIN_TOKEN"
# Manually trigger analysis for a file
curl -X POST "http://localhost:1337/api/companies/1/document-analysis/analyze/456" \
-H "Authorization: Bearer $ADMIN_TOKEN"
Frontend Integration¶
Document Analysis Sidebar Menu¶
The Document Analysis feature is designed to be displayed as a sidebar menu item in Organization Settings. Here's the recommended UI structure:
Sidebar Menu Entry:
Organization Settings
├── General
├── Members
├── Document Analysis <-- New entry
└── ...
Dashboard View (/org/:companyId/document-analysis):
Display the summary statistics from GET /document-analysis/summary:
- Total documents processed
- Success rate percentage
- Status breakdown (completed, failed, pending, processing)
- OCR enabled/disabled status
- Service availability indicator
List View (/org/:companyId/document-analysis/list):
Display paginated table from GET /document-analysis:
| Column | Description |
|--------|-------------|
| File Name | Link to file preview |
| Status | Badge with color (green=completed, red=failed, yellow=pending) |
| Words | Word count extracted |
| Confidence | OCR confidence percentage |
| Processed | When processing completed |
| Actions | Retry button (for failed/skipped) |
Filters: - Status dropdown (All, Completed, Failed, Pending, Processing, Skipped) - Date range picker
Detail View (/org/:companyId/document-analysis/:id):
Show full analysis details including: - File preview/download link - Processing timeline - Extracted text (with copy button) - OCR confidence score - Error message (if failed) - Retry button
Status Badge Colors¶
| Status | Color | Icon |
|---|---|---|
| COMPLETED | Green | ✓ |
| FAILED | Red | ✗ |
| PENDING | Yellow | ⏳ |
| PROCESSING | Blue | ⟳ |
| SKIPPED | Gray | − |
Troubleshooting¶
OCR Service Not Available¶
- Check if OCR is enabled globally:
OCR_ENABLED=true - Verify OCR service is running:
docker-compose --profile ocr ps - Check service health:
curl http://localhost:8000/
Processing Failures¶
- Check document status via admin API
- Review error messages in
document_contents.error_message - Check OCR service logs:
docker-compose logs ocr-service
Slow Processing¶
- Consider enabling GPU:
OCR_USE_GPU=true - Increase timeout:
OCR_TIMEOUT_SECONDS=300 - Check file size limits in OCR service configuration
Testing¶
The Document Analysis feature includes comprehensive test coverage:
Unit Tests¶
DTO Tests (DocumentAnalysisDtoTest.java) - 32 tests
- DocumentAnalysisSummaryDto: Builder pattern, getters/setters, success rate calculation
- DocumentAnalysisItemDto: Confidence percent calculation, processing duration, retryable status
- DocumentAnalysisDetailDto: File size conversion, status helpers (isComplete, isInProgress, isRetryable)
- Edge cases: Zero values, null handling, large numbers
Controller Unit Tests (DocumentAnalysisControllerTest.java) - 22 tests
- GET /summary - Returns correct summary with mocked service
- GET / - Pagination, status filtering, file info mapping
- GET /{analysisId} - Detail retrieval with/without full text
- POST /{analysisId}/retry - Authorization, status validation
- POST /retry-all-failed - Batch retry with count
- POST /analyze/{fileId} - File validation, OCR checks
Integration Tests¶
Controller Integration Tests (DocumentAnalysisControllerIntegrationTest.java) - 28 tests
Uses Testcontainers for real database testing:
| Test Category | Tests | Description |
|---|---|---|
| GetSummary | 3 | Summary stats, auth, 404 handling |
| ListAnalyses | 5 | Pagination, COMPLETED/FAILED/PENDING filters, file info |
| GetAnalysisDetail | 4 | Detail with/without text, company isolation |
| RetryAnalysis | 4 | Admin retry, regular user 403, status validation |
| RetryAllFailed | 3 | Batch retry, zero count, authorization |
| AnalyzeFile | 4 | OCR disabled, service unavailable, authorization |
| DataIntegrity | 2 | Company isolation, database consistency |
| EdgeCases | 3 | Empty company, pagination edge cases |
Running Tests¶
# Run all tests
mvn test
# Run only Document Analysis tests
mvn test -Dtest="DocumentAnalysis*"
# Run only integration tests
mvn test -Dtest="DocumentAnalysisControllerIntegrationTest"
# Run only DTO tests
mvn test -Dtest="DocumentAnalysisDtoTest"
Test Data Setup¶
Integration tests create: - Test company with OCR enabled - Regular user (Authenticated role) - Admin user (Admin role) - Sample files (PDF) in different states: - COMPLETED: With extracted text, confidence, word count - FAILED: With error message - PENDING: Awaiting processing
Production Readiness¶
Implemented Features¶
| Feature | Status | Notes |
|---|---|---|
| Authorization | ✅ | Role-based: Authenticated (view), ADMIN/MANAGER (actions) |
| Multi-tenancy | ✅ | Company isolation via company_id foreign key |
| Error Handling | ✅ | Meaningful messages for all failure scenarios |
| Async Processing | ✅ | @Async("ocrExecutor") for non-blocking OCR |
| Pagination | ✅ | Spring Data pagination with status filtering + max page size (100) |
| Rate Limiting | ✅ | Redis-based rate limiting on retry/analyze endpoints |
| Metrics & Monitoring | ✅ | Micrometer metrics for Prometheus/Actuator |
| API Documentation | ✅ | OpenAPI annotations on all endpoints |
| Logging | ✅ | Structured logging with SLF4J |
| Test Coverage | ✅ | 117+ tests (DTO + unit + integration) |
Security Considerations¶
| Aspect | Status | Details |
|---|---|---|
| Authentication | ✅ | JWT required for all endpoints |
| Authorization | ✅ | @PreAuthorize on all endpoints |
| Company Isolation | ✅ | Documents filtered by companyId parameter |
| Input Validation | ✅ | Status enum validated, max page size enforced |
| Rate Limiting | ✅ | Configurable per-user and per-company limits |
Rate Limiting Configuration¶
Rate limits are configurable via application.yml:
app:
ocr:
rate-limit:
retry:
max-requests: 10 # Max retries per user per minute
window-seconds: 60
analyze:
max-requests: 20 # Max analyze requests per user per minute
window-seconds: 60
retry-all:
max-requests: 3 # Max batch retries per company per 5 minutes
window-seconds: 300
pagination:
max-page-size: 100 # Max items per page
Monitoring & Metrics¶
OCR processing exposes Prometheus metrics via Spring Actuator:
Counters:
| Metric | Description |
|--------|-------------|
| ocr.documents.processed.total | Total documents submitted for processing |
| ocr.documents.success.total | Successful OCR extractions |
| ocr.documents.failed.total | Failed OCR extractions |
| ocr.retry.total | Manual retry requests |
| ocr.retry.all.total | Batch retry requests |
Gauges:
| Metric | Description |
|--------|-------------|
| ocr.queue.pending | Documents awaiting processing |
| ocr.queue.processing | Documents currently being processed |
| ocr.queue.failed | Documents with failed status |
Timer:
| Metric | Description |
|--------|-------------|
| ocr.processing.duration | Processing time (p50, p75, p95, p99) |
Queue Status Endpoint:
GET /api/companies/{companyId}/document-analysis/queue-status
Returns:
{
"data": {
"pending": 5,
"processing": 2,
"completed": 150,
"failed": 3,
"skipped": 1,
"total": 161
}
}
Health Endpoint:
GET /actuator/health
Returns comprehensive health status including OCR service (when enabled):
{
"status": "UP",
"components": {
"application": {
"status": "UP",
"details": {
"name": "TGM Manager Server",
"version": "1.1.0",
"features": {
"ocr": true,
"influxDb": false,
"semanticSearch": false,
"email": true,
"sms": false,
"sso": true,
"license": true,
"rabbitmq": true,
"ragflow": false,
"cron": true
}
}
},
"db": { "status": "UP" },
"minio": { "status": "UP", "details": { "bucket": "tgm-uploads" } },
"rabbit": { "status": "UP", "details": { "version": "3.12.14" } },
"redis": { "status": "UP", "details": { "version": "8.4.0" } },
"ocr": { "status": "UP", "details": { "serviceUrl": "http://localhost:8000" } },
"influxdb": { "status": "UP", "details": { "org": "ensolutions" } }
}
}
Performance¶
| Aspect | Implementation |
|---|---|
| Async OCR | Dedicated ocrExecutor thread pool |
| Database Indexes | Full-text search index on extracted_text |
| Lazy Loading | File info fetched separately per document |
| Pagination | Server-side with enforced max size (100) |
| Rate Limiting | Redis-based via CacheService |
Alerting Recommendations¶
Use the metrics to set up alerts:
# Prometheus alerting rules
groups:
- name: ocr-alerts
rules:
- alert: OcrQueueBacklogHigh
expr: ocr_queue_pending > 100
for: 5m
labels:
severity: warning
annotations:
summary: "OCR queue backlog is high ({{ $value }} pending)"
- alert: OcrHighFailureRate
expr: rate(ocr_documents_failed_total[5m]) / rate(ocr_documents_processed_total[5m]) > 0.3
for: 10m
labels:
severity: critical
annotations:
summary: "OCR failure rate exceeds 30%"