Health monitoring

This document describes the health monitoring endpoints and service status indicators for the TGM Manager Server.

Table of Contents¶

Overview
Multi-Tenancy Context
Health Endpoint
Health Components
Actuator Endpoints
Configuration
Alerting Integration

Overview¶

TGM Manager Server provides comprehensive health monitoring through Spring Boot Actuator endpoints. The health endpoint shows the status of all external services and application features, making it easy to:

Monitor service availability
Debug connectivity issues
Integrate with monitoring systems (Prometheus, Grafana, etc.)
Configure Kubernetes/Docker health probes

Multi-Tenancy Context¶

Health monitoring operates at the platform level, not per-client:

┌─────────────────────────────────────────────────────────────────┐
│                    Platform Level                                │
│  /actuator/health - Checks platform-wide services               │
├─────────────────────────────────────────────────────────────────┤
│  • Master Database Connection                                    │
│  • Redis (shared cache)                                         │
│  • RabbitMQ (shared message broker)                             │
│  • MinIO/S3 (global storage)                                    │
│  • OCR Service (shared)                                         │
│  • InfluxDB (shared metrics)                                    │
└─────────────────────────────────────────────────────────────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        ▼                      ▼                      ▼
┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│   Client A    │      │   Client B    │      │   Client C    │
│   Database    │      │   Database    │      │   Database    │
│  (separate)   │      │  (separate)   │      │  (separate)   │
└───────────────┘      └───────────────┘      └───────────────┘

Key Points¶

Health endpoint is public - No authentication required for /actuator/health
Platform-wide status - Shows overall system health, not individual client databases
Client database health - Managed via multi-tenant DataSource routing (not exposed in health endpoint)

Per-Client Health¶

For per-client database status, use the platform admin endpoints:

# Check all client databases
curl -X GET "http://localhost:1337/api/platform/clients/migration-status" \
  -H "Authorization: Bearer $ADMIN_JWT"

Health Endpoint¶

Basic Health Check¶

GET /actuator/health

Access: Public (no authentication required)

Response Example¶

{
  "status": "UP",
  "components": {
    "application": {
      "status": "UP",
      "details": {
        "name": "TGM Manager Server",
        "version": "1.1.0",
        "features": {
          "ocr": false,
          "influxDb": false,
          "semanticSearch": false,
          "email": true,
          "sms": false,
          "sso": true,
          "license": true,
          "rabbitmq": true,
          "ragflow": false,
          "cron": true
        }
      }
    },
    "db": {
      "status": "UP",
      "components": {
        "master": {
          "status": "UP",
          "details": {
            "database": "PostgreSQL",
            "validationQuery": "isValid()"
          }
        }
      }
    },
    "diskSpace": {
      "status": "UP",
      "details": {
        "total": 494384795648,
        "free": 52170293248,
        "threshold": 10485760,
        "path": "/app/.",
        "exists": true
      }
    },
    "minio": {
      "status": "UP",
      "details": {
        "status": "connected",
        "endpoint": "http://minio:9000",
        "bucket": "tgm-uploads",
        "bucketExists": true
      }
    },
    "rabbit": {
      "status": "UP",
      "details": {
        "version": "3.12.14"
      }
    },
    "redis": {
      "status": "UP",
      "details": {
        "version": "8.4.0"
      }
    },
    "ocr": {
      "status": "UP",
      "details": {
        "status": "available",
        "serviceUrl": "http://ocr-service:8000"
      }
    },
    "influxdb": {
      "status": "UP",
      "details": {
        "status": "connected",
        "url": "http://influxdb:8086",
        "org": "ensolutions",
        "bucket": "tgm-metrics"
      }
    }
  }
}

Status Values¶

Status	Description
`UP`	Service is healthy and operational
`DOWN`	Service is unavailable or unhealthy
`OUT_OF_SERVICE`	Service is intentionally offline
`UNKNOWN`	Status cannot be determined

The overall status is DOWN if any component is DOWN.

Health Components¶

Built-in Components (Spring Boot)¶

Component	Description	When Active
`db`	PostgreSQL database connection	Always
`redis`	Redis cache/session store	Always
`rabbit`	RabbitMQ message broker	When `spring.rabbitmq.enabled=true`
`diskSpace`	Available disk space	Always
`ping`	Basic health check	Always
`ssl`	SSL certificate status	Always

Custom Health Indicators¶

Component	Class	Description	Conditional
`application`	`ApplicationHealthIndicator`	App version and feature flags	Always
`minio`	`MinioHealthIndicator`	MinIO/S3 storage	When `app.storage.type=minio`
`ocr`	`OcrHealthIndicator`	OCR service	When `app.ocr.enabled=true`
`influxdb`	`InfluxDbHealthIndicator`	InfluxDB connection	When `app.influxdb.enabled=true`

Application Features¶

The application component shows which features are enabled:

Feature	Configuration Property	Description
`ocr`	`app.ocr.enabled`	PaddleOCR document analysis
`influxDb`	`app.influxdb.enabled`	InfluxDB time-series storage
`semanticSearch`	`app.semantic-search.enabled`	Semantic search with embeddings
`email`	`app.email.enabled`	Email notifications
`sms`	`app.sms.enabled`	SMS notifications
`sso`	`app.sso.enabled`	Single Sign-On
`license`	`app.license.enabled`	License validation
`rabbitmq`	`spring.rabbitmq.enabled`	RabbitMQ messaging
`ragflow`	`app.ragflow.enabled`	RAGFlow AI integration
`cron`	`app.cron.enabled`	Scheduled jobs

Actuator Endpoints¶

Available Endpoints¶

Endpoint	Access	Description
`/actuator/health`	Public	Health status of all components
`/actuator/info`	Admin	Application information
`/actuator/metrics`	Admin	Application metrics
`/actuator/prometheus`	Admin	Prometheus metrics export

Security¶

/actuator/health is publicly accessible for health probes
All other actuator endpoints require ADMIN role authentication

// SecurityConfig.java
.requestMatchers("/actuator/health").permitAll()
.requestMatchers("/actuator/**").hasRole("ADMIN")

Configuration¶

application.yml¶

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  endpoint:
    health:
      show-details: always
      show-components: always
  health:
    db:
      enabled: true
    redis:
      enabled: true
    rabbit:
      enabled: true
    diskspace:
      enabled: true
    influxdb:
      enabled: ${ENABLE_INFLUXDB:false}
    ocr:
      enabled: ${OCR_ENABLED:false}

Environment Variables¶

Variable	Default	Description
`OCR_ENABLED`	`false`	Enable OCR health indicator
`ENABLE_INFLUXDB`	`false`	Enable InfluxDB health indicator

Alerting Integration¶

Prometheus Alerting Rules¶

groups:
  - name: tgm-health-alerts
    rules:
      - alert: ServiceDown
        expr: up{job="tgm-manager"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "TGM Manager is down"

      - alert: DatabaseDown
        expr: health_db_status{job="tgm-manager"} == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL database is down"

      - alert: RedisDown
        expr: health_redis_status{job="tgm-manager"} == 0
        for: 1m
        labels:
          severity: high
        annotations:
          summary: "Redis cache is down"

      - alert: StorageDown
        expr: health_minio_status{job="tgm-manager"} == 0
        for: 2m
        labels:
          severity: high
        annotations:
          summary: "MinIO storage is down"

      - alert: LowDiskSpace
        expr: health_diskspace_free{job="tgm-manager"} < 1073741824
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space (< 1GB free)"

Kubernetes Probes¶

apiVersion: v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: tgm-manager
          livenessProbe:
            httpGet:
              path: /actuator/health
              port: 1337
            initialDelaySeconds: 60
            periodSeconds: 10
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /actuator/health
              port: 1337
            initialDelaySeconds: 30
            periodSeconds: 5
            failureThreshold: 3

Docker Compose Health Check¶

services:
  tgm-manager:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:1337/actuator/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

Troubleshooting¶

Component Shows DOWN¶

Database DOWN
Check PostgreSQL is running
Verify connection string in configuration
Check credentials
Redis DOWN
Check Redis is running on configured host/port
Verify password if configured
RabbitMQ DOWN
Check RabbitMQ is running
Verify virtual host and credentials
MinIO DOWN
Check MinIO is running on configured endpoint
Verify access key and secret key
Check bucket exists
OCR DOWN
Check OCR service container is running
Verify service URL is accessible
Check OCR service logs
InfluxDB DOWN
Check InfluxDB is running
Verify token and organization
Check bucket exists

Health Endpoint Returns 403¶

Other actuator endpoints (not /health) require ADMIN authentication
Use JWT token with ADMIN role to access protected endpoints

Missing Health Components¶

Conditional health indicators only appear when their feature is enabled
Check configuration properties to enable features