This document describes the health monitoring endpoints and service status indicators for the TGM Manager Server.

Table of Contents


Overview

TGM Manager Server provides comprehensive health monitoring through Spring Boot Actuator endpoints. The health endpoint shows the status of all external services and application features, making it easy to:

  • Monitor service availability
  • Debug connectivity issues
  • Integrate with monitoring systems (Prometheus, Grafana, etc.)
  • Configure Kubernetes/Docker health probes

Multi-Tenancy Context

Health monitoring operates at the platform level, not per-client:

┌─────────────────────────────────────────────────────────────────┐
│                    Platform Level                                │
│  /actuator/health - Checks platform-wide services               │
├─────────────────────────────────────────────────────────────────┤
│  • Master Database Connection                                    │
│  • Redis (shared cache)                                         │
│  • RabbitMQ (shared message broker)                             │
│  • MinIO/S3 (global storage)                                    │
│  • OCR Service (shared)                                         │
│  • InfluxDB (shared metrics)                                    │
└─────────────────────────────────────────────────────────────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        ▼                      ▼                      ▼
┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│   Client A    │      │   Client B    │      │   Client C    │
│   Database    │      │   Database    │      │   Database    │
│  (separate)   │      │  (separate)   │      │  (separate)   │
└───────────────┘      └───────────────┘      └───────────────┘

Key Points

  1. Health endpoint is public - No authentication required for /actuator/health
  2. Platform-wide status - Shows overall system health, not individual client databases
  3. Client database health - Managed via multi-tenant DataSource routing (not exposed in health endpoint)

Per-Client Health

For per-client database status, use the platform admin endpoints:

# Check all client databases
curl -X GET "http://localhost:1337/api/platform/clients/migration-status" \
  -H "Authorization: Bearer $ADMIN_JWT"

Health Endpoint

Basic Health Check

GET /actuator/health

Access: Public (no authentication required)

Response Example

{
  "status": "UP",
  "components": {
    "application": {
      "status": "UP",
      "details": {
        "name": "TGM Manager Server",
        "version": "1.1.0",
        "features": {
          "ocr": false,
          "influxDb": false,
          "semanticSearch": false,
          "email": true,
          "sms": false,
          "sso": true,
          "license": true,
          "rabbitmq": true,
          "ragflow": false,
          "cron": true
        }
      }
    },
    "db": {
      "status": "UP",
      "components": {
        "master": {
          "status": "UP",
          "details": {
            "database": "PostgreSQL",
            "validationQuery": "isValid()"
          }
        }
      }
    },
    "diskSpace": {
      "status": "UP",
      "details": {
        "total": 494384795648,
        "free": 52170293248,
        "threshold": 10485760,
        "path": "/app/.",
        "exists": true
      }
    },
    "minio": {
      "status": "UP",
      "details": {
        "status": "connected",
        "endpoint": "http://minio:9000",
        "bucket": "tgm-uploads",
        "bucketExists": true
      }
    },
    "rabbit": {
      "status": "UP",
      "details": {
        "version": "3.12.14"
      }
    },
    "redis": {
      "status": "UP",
      "details": {
        "version": "8.4.0"
      }
    },
    "ocr": {
      "status": "UP",
      "details": {
        "status": "available",
        "serviceUrl": "http://ocr-service:8000"
      }
    },
    "influxdb": {
      "status": "UP",
      "details": {
        "status": "connected",
        "url": "http://influxdb:8086",
        "org": "ensolutions",
        "bucket": "tgm-metrics"
      }
    }
  }
}

Status Values

Status Description
UP Service is healthy and operational
DOWN Service is unavailable or unhealthy
OUT_OF_SERVICE Service is intentionally offline
UNKNOWN Status cannot be determined

The overall status is DOWN if any component is DOWN.


Health Components

Built-in Components (Spring Boot)

Component Description When Active
db PostgreSQL database connection Always
redis Redis cache/session store Always
rabbit RabbitMQ message broker When spring.rabbitmq.enabled=true
diskSpace Available disk space Always
ping Basic health check Always
ssl SSL certificate status Always

Custom Health Indicators

Component Class Description Conditional
application ApplicationHealthIndicator App version and feature flags Always
minio MinioHealthIndicator MinIO/S3 storage When app.storage.type=minio
ocr OcrHealthIndicator OCR service When app.ocr.enabled=true
influxdb InfluxDbHealthIndicator InfluxDB connection When app.influxdb.enabled=true

Application Features

The application component shows which features are enabled:

Feature Configuration Property Description
ocr app.ocr.enabled PaddleOCR document analysis
influxDb app.influxdb.enabled InfluxDB time-series storage
semanticSearch app.semantic-search.enabled Semantic search with embeddings
email app.email.enabled Email notifications
sms app.sms.enabled SMS notifications
sso app.sso.enabled Single Sign-On
license app.license.enabled License validation
rabbitmq spring.rabbitmq.enabled RabbitMQ messaging
ragflow app.ragflow.enabled RAGFlow AI integration
cron app.cron.enabled Scheduled jobs

Actuator Endpoints

Available Endpoints

Endpoint Access Description
/actuator/health Public Health status of all components
/actuator/info Admin Application information
/actuator/metrics Admin Application metrics
/actuator/prometheus Admin Prometheus metrics export

Security

  • /actuator/health is publicly accessible for health probes
  • All other actuator endpoints require ADMIN role authentication
// SecurityConfig.java
.requestMatchers("/actuator/health").permitAll()
.requestMatchers("/actuator/**").hasRole("ADMIN")

Configuration

application.yml

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  endpoint:
    health:
      show-details: always
      show-components: always
  health:
    db:
      enabled: true
    redis:
      enabled: true
    rabbit:
      enabled: true
    diskspace:
      enabled: true
    influxdb:
      enabled: ${ENABLE_INFLUXDB:false}
    ocr:
      enabled: ${OCR_ENABLED:false}

Environment Variables

Variable Default Description
OCR_ENABLED false Enable OCR health indicator
ENABLE_INFLUXDB false Enable InfluxDB health indicator

Alerting Integration

Prometheus Alerting Rules

groups:
  - name: tgm-health-alerts
    rules:
      - alert: ServiceDown
        expr: up{job="tgm-manager"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "TGM Manager is down"

      - alert: DatabaseDown
        expr: health_db_status{job="tgm-manager"} == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL database is down"

      - alert: RedisDown
        expr: health_redis_status{job="tgm-manager"} == 0
        for: 1m
        labels:
          severity: high
        annotations:
          summary: "Redis cache is down"

      - alert: StorageDown
        expr: health_minio_status{job="tgm-manager"} == 0
        for: 2m
        labels:
          severity: high
        annotations:
          summary: "MinIO storage is down"

      - alert: LowDiskSpace
        expr: health_diskspace_free{job="tgm-manager"} < 1073741824
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space (< 1GB free)"

Kubernetes Probes

apiVersion: v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: tgm-manager
          livenessProbe:
            httpGet:
              path: /actuator/health
              port: 1337
            initialDelaySeconds: 60
            periodSeconds: 10
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /actuator/health
              port: 1337
            initialDelaySeconds: 30
            periodSeconds: 5
            failureThreshold: 3

Docker Compose Health Check

services:
  tgm-manager:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:1337/actuator/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

Troubleshooting

Component Shows DOWN

  1. Database DOWN
  2. Check PostgreSQL is running
  3. Verify connection string in configuration
  4. Check credentials

  5. Redis DOWN

  6. Check Redis is running on configured host/port
  7. Verify password if configured

  8. RabbitMQ DOWN

  9. Check RabbitMQ is running
  10. Verify virtual host and credentials

  11. MinIO DOWN

  12. Check MinIO is running on configured endpoint
  13. Verify access key and secret key
  14. Check bucket exists

  15. OCR DOWN

  16. Check OCR service container is running
  17. Verify service URL is accessible
  18. Check OCR service logs

  19. InfluxDB DOWN

  20. Check InfluxDB is running
  21. Verify token and organization
  22. Check bucket exists

Health Endpoint Returns 403

  • Other actuator endpoints (not /health) require ADMIN authentication
  • Use JWT token with ADMIN role to access protected endpoints

Missing Health Components

  • Conditional health indicators only appear when their feature is enabled
  • Check configuration properties to enable features