Health monitoring
This document describes the health monitoring endpoints and service status indicators for the TGM Manager Server.
Table of Contents¶
- Overview
- Multi-Tenancy Context
- Health Endpoint
- Health Components
- Actuator Endpoints
- Configuration
- Alerting Integration
Overview¶
TGM Manager Server provides comprehensive health monitoring through Spring Boot Actuator endpoints. The health endpoint shows the status of all external services and application features, making it easy to:
- Monitor service availability
- Debug connectivity issues
- Integrate with monitoring systems (Prometheus, Grafana, etc.)
- Configure Kubernetes/Docker health probes
Multi-Tenancy Context¶
Health monitoring operates at the platform level, not per-client:
┌─────────────────────────────────────────────────────────────────┐
│ Platform Level │
│ /actuator/health - Checks platform-wide services │
├─────────────────────────────────────────────────────────────────┤
│ • Master Database Connection │
│ • Redis (shared cache) │
│ • RabbitMQ (shared message broker) │
│ • MinIO/S3 (global storage) │
│ • OCR Service (shared) │
│ • InfluxDB (shared metrics) │
└─────────────────────────────────────────────────────────────────┘
│
┌──────────────────────┼──────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Client A │ │ Client B │ │ Client C │
│ Database │ │ Database │ │ Database │
│ (separate) │ │ (separate) │ │ (separate) │
└───────────────┘ └───────────────┘ └───────────────┘
Key Points¶
- Health endpoint is public - No authentication required for
/actuator/health - Platform-wide status - Shows overall system health, not individual client databases
- Client database health - Managed via multi-tenant DataSource routing (not exposed in health endpoint)
Per-Client Health¶
For per-client database status, use the platform admin endpoints:
# Check all client databases
curl -X GET "http://localhost:1337/api/platform/clients/migration-status" \
-H "Authorization: Bearer $ADMIN_JWT"
Health Endpoint¶
Basic Health Check¶
GET /actuator/health
Access: Public (no authentication required)
Response Example¶
{
"status": "UP",
"components": {
"application": {
"status": "UP",
"details": {
"name": "TGM Manager Server",
"version": "1.1.0",
"features": {
"ocr": false,
"influxDb": false,
"semanticSearch": false,
"email": true,
"sms": false,
"sso": true,
"license": true,
"rabbitmq": true,
"ragflow": false,
"cron": true
}
}
},
"db": {
"status": "UP",
"components": {
"master": {
"status": "UP",
"details": {
"database": "PostgreSQL",
"validationQuery": "isValid()"
}
}
}
},
"diskSpace": {
"status": "UP",
"details": {
"total": 494384795648,
"free": 52170293248,
"threshold": 10485760,
"path": "/app/.",
"exists": true
}
},
"minio": {
"status": "UP",
"details": {
"status": "connected",
"endpoint": "http://minio:9000",
"bucket": "tgm-uploads",
"bucketExists": true
}
},
"rabbit": {
"status": "UP",
"details": {
"version": "3.12.14"
}
},
"redis": {
"status": "UP",
"details": {
"version": "8.4.0"
}
},
"ocr": {
"status": "UP",
"details": {
"status": "available",
"serviceUrl": "http://ocr-service:8000"
}
},
"influxdb": {
"status": "UP",
"details": {
"status": "connected",
"url": "http://influxdb:8086",
"org": "ensolutions",
"bucket": "tgm-metrics"
}
}
}
}
Status Values¶
| Status | Description |
|---|---|
UP |
Service is healthy and operational |
DOWN |
Service is unavailable or unhealthy |
OUT_OF_SERVICE |
Service is intentionally offline |
UNKNOWN |
Status cannot be determined |
The overall status is DOWN if any component is DOWN.
Health Components¶
Built-in Components (Spring Boot)¶
| Component | Description | When Active |
|---|---|---|
db |
PostgreSQL database connection | Always |
redis |
Redis cache/session store | Always |
rabbit |
RabbitMQ message broker | When spring.rabbitmq.enabled=true |
diskSpace |
Available disk space | Always |
ping |
Basic health check | Always |
ssl |
SSL certificate status | Always |
Custom Health Indicators¶
| Component | Class | Description | Conditional |
|---|---|---|---|
application |
ApplicationHealthIndicator |
App version and feature flags | Always |
minio |
MinioHealthIndicator |
MinIO/S3 storage | When app.storage.type=minio |
ocr |
OcrHealthIndicator |
OCR service | When app.ocr.enabled=true |
influxdb |
InfluxDbHealthIndicator |
InfluxDB connection | When app.influxdb.enabled=true |
Application Features¶
The application component shows which features are enabled:
| Feature | Configuration Property | Description |
|---|---|---|
ocr |
app.ocr.enabled |
PaddleOCR document analysis |
influxDb |
app.influxdb.enabled |
InfluxDB time-series storage |
semanticSearch |
app.semantic-search.enabled |
Semantic search with embeddings |
email |
app.email.enabled |
Email notifications |
sms |
app.sms.enabled |
SMS notifications |
sso |
app.sso.enabled |
Single Sign-On |
license |
app.license.enabled |
License validation |
rabbitmq |
spring.rabbitmq.enabled |
RabbitMQ messaging |
ragflow |
app.ragflow.enabled |
RAGFlow AI integration |
cron |
app.cron.enabled |
Scheduled jobs |
Actuator Endpoints¶
Available Endpoints¶
| Endpoint | Access | Description |
|---|---|---|
/actuator/health |
Public | Health status of all components |
/actuator/info |
Admin | Application information |
/actuator/metrics |
Admin | Application metrics |
/actuator/prometheus |
Admin | Prometheus metrics export |
Security¶
/actuator/healthis publicly accessible for health probes- All other actuator endpoints require
ADMINrole authentication
// SecurityConfig.java
.requestMatchers("/actuator/health").permitAll()
.requestMatchers("/actuator/**").hasRole("ADMIN")
Configuration¶
application.yml¶
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus
endpoint:
health:
show-details: always
show-components: always
health:
db:
enabled: true
redis:
enabled: true
rabbit:
enabled: true
diskspace:
enabled: true
influxdb:
enabled: ${ENABLE_INFLUXDB:false}
ocr:
enabled: ${OCR_ENABLED:false}
Environment Variables¶
| Variable | Default | Description |
|---|---|---|
OCR_ENABLED |
false |
Enable OCR health indicator |
ENABLE_INFLUXDB |
false |
Enable InfluxDB health indicator |
Alerting Integration¶
Prometheus Alerting Rules¶
groups:
- name: tgm-health-alerts
rules:
- alert: ServiceDown
expr: up{job="tgm-manager"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "TGM Manager is down"
- alert: DatabaseDown
expr: health_db_status{job="tgm-manager"} == 0
for: 30s
labels:
severity: critical
annotations:
summary: "PostgreSQL database is down"
- alert: RedisDown
expr: health_redis_status{job="tgm-manager"} == 0
for: 1m
labels:
severity: high
annotations:
summary: "Redis cache is down"
- alert: StorageDown
expr: health_minio_status{job="tgm-manager"} == 0
for: 2m
labels:
severity: high
annotations:
summary: "MinIO storage is down"
- alert: LowDiskSpace
expr: health_diskspace_free{job="tgm-manager"} < 1073741824
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space (< 1GB free)"
Kubernetes Probes¶
apiVersion: v1
kind: Deployment
spec:
template:
spec:
containers:
- name: tgm-manager
livenessProbe:
httpGet:
path: /actuator/health
port: 1337
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /actuator/health
port: 1337
initialDelaySeconds: 30
periodSeconds: 5
failureThreshold: 3
Docker Compose Health Check¶
services:
tgm-manager:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:1337/actuator/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
Troubleshooting¶
Component Shows DOWN¶
- Database DOWN
- Check PostgreSQL is running
- Verify connection string in configuration
-
Check credentials
-
Redis DOWN
- Check Redis is running on configured host/port
-
Verify password if configured
-
RabbitMQ DOWN
- Check RabbitMQ is running
-
Verify virtual host and credentials
-
MinIO DOWN
- Check MinIO is running on configured endpoint
- Verify access key and secret key
-
Check bucket exists
-
OCR DOWN
- Check OCR service container is running
- Verify service URL is accessible
-
Check OCR service logs
-
InfluxDB DOWN
- Check InfluxDB is running
- Verify token and organization
- Check bucket exists
Health Endpoint Returns 403¶
- Other actuator endpoints (not
/health) require ADMIN authentication - Use JWT token with ADMIN role to access protected endpoints
Missing Health Components¶
- Conditional health indicators only appear when their feature is enabled
- Check configuration properties to enable features