Health Check System

This document describes the health check system in GitLab Workhorse that supports readiness checks.

Overview

The health check system provides a server that handles readiness checks:

Readiness checks (/readiness): Determine if the service is ready to accept traffic

Configuration

Recommended Configuration

The health check system uses a health_check_listener configuration:

toml

[health_check_listener]
  network = "tcp"
  addr = "localhost:8182"
  readiness_probe_url = "http://localhost:8080/-/readiness"
  puma_control_url = "http://localhost:9293"
  check_interval = "5s"
  timeout = "1s"
  graceful_shutdown_delay = "10s"
  max_consecutive_failures = 3
  min_successful_probes = 1

Configuration Options

network: Network type (tcp, tcp4, tcp6, unix)
addr: Address to listen on (e.g., localhost:8182)
check_interval: How often to perform health checks (default: 2s)
timeout: Timeout for individual health check requests (default: 1s)
graceful_shutdown_delay: How long to remain unhealthy after shutdown signal (default: 10s)
max_consecutive_failures: Number of consecutive failures before marking readiness as unhealthy (default: 3)
min_successful_probes: Number of successful probes required for readiness to become healthy (default: 2)
readiness_probe_url: URL of Puma's readiness endpoint (default: authBackend + "/-/readiness")
puma_control_url: URL of Puma's control server (optional, for readiness checks only)

Health Check Types

Readiness Checks (`/readiness`)

Readiness checks determine if the service is ready to accept traffic. They check:

Puma readiness endpoint availability
Puma control server statistics (if configured)
Worker health and availability

Behavior:

Starts as unhealthy (not ready)
Requires min_successful_probes consecutive successes to become ready
Uses max_consecutive_failures threshold before marking as not ready
Becomes unhealthy during graceful shutdown

HTTP Endpoints

The health check listener exposes HTTP endpoints that return JSON responses:

Readiness Endpoint (`/readiness`)

Healthy Response (200 OK)

json

{
  "checks": {
    "puma_readiness": {
      "control_duration_s": 0.00176875,
      "control_server": true,
      "control_server_last_scrape_time": "2025-10-02T21:55:43Z",
      "healthy": true,
      "readiness_duration_s": 0.028158458,
      "readiness_endpoint": true,
      "readiness_last_scrape_time": "2025-10-02T21:55:43Z"
    }
  },
  "health_thresholds": {
    "max_consecutive_failures": 3,
    "min_successful_probes": 1
  },
  "metrics": {
    "consecutive_failures": 0,
    "consecutive_successes": 1
  },
  "ready": true
}

Unhealthy Response (503 Service Unavailable)

json

{
  "checks": {
    "puma_readiness": {
      "control_duration_s": 0,
      "control_server": false,
      "control_server_last_scrape_time": "2025-10-02T21:56:13Z",
      "healthy": false,
      "readiness_duration_s": 0,
      "readiness_endpoint": false
    }
  },
  "health_thresholds": {
    "max_consecutive_failures": 3,
    "min_successful_probes": 1
  },
  "last_error": "puma control server check failed: Get \"http://localhost:9293/stats\": dial tcp [::1]:9293: connect: connection refused",
  "metrics": {
    "consecutive_failures": 3,
    "consecutive_successes": 0
  },
  "ready": false
}

Prometheus Metrics

The health check system exposes several Prometheus metrics:

Overall Status Metrics

workhorse_readiness_status: Overall readiness status (1 = ready, 0 = not ready)

Error Metrics

workhorse_readiness_errors_total: Total number of readiness check errors

Performance Metrics

workhorse_health_check_duration_seconds: Duration of health checks

Individual Check Metrics

workhorse_readiness_puma_readiness_check: Status of Puma readiness endpoint

Graceful Shutdown

When Workhorse receives a SIGTERM signal:

Readiness: Immediately becomes unhealthy (503) to stop receiving new traffic
Delay: Waits for graceful_shutdown_delay to allow load balancers to drain traffic
Shutdown: Proceeds with normal shutdown after the delay

This approach eliminates 502 errors during deployments and provides smooth traffic transitions.

Extending the System

Adding New Checkers

Implement the HealthChecker interface:

type MyChecker struct {
    name string
}

func (c *MyChecker) Name() string {
    return c.name
}

func (c *MyChecker) Check(ctx context.Context) CheckResult {
    // Perform your health check logic
    return CheckResult{
        Name:    c.name,
        Healthy: true,
        Details: map[string]interface{}{
            "your-field": "ok",
        },
    }
}

Registering Custom Checkers

// Add to readiness checks
server.AddReadinessChecker(NewMyChecker("my_readiness_check"))

Troubleshooting

Common Issues

Health check endpoints not responding
- Verify health_check_listener is configured in config.toml
- Check if the listen address is available
- Review Workhorse startup logs for errors
Readiness always unhealthy
- Check if puma_readiness_url is accessible
- Verify Puma is running and healthy
- Review consecutive failure/success thresholds
Flapping health status
- Adjust max_consecutive_failures and min_successful_probes
- Increase timeout values if checks are timing out
- Review check_interval frequency

Debug Logging

Enable debug logging to see detailed health check information:

bash

gitlab-workhorse -logLevel debug

Manual Testing

Test health check endpoints manually:

bash

# Test readiness
curl -v http://localhost:8182/readiness

Architecture

Components

Server: Main server that manages readiness checks
HealthChecker Interface: Contract for individual health checkers
PumaReadinessChecker: Checks Puma's readiness endpoint and control server

Flow

Initialization: Server starts with configured checkers
Periodic Checks: Runs health checks at configured intervals
State Management: Updates readiness status based on check results
HTTP Serving: Responds to health check requests with current status
Metrics Export: Updates Prometheus metrics continuously

Migration from Legacy System

Benefits of the Unified System

Better Logic: Improved consecutive failure/success handling
Enhanced Metrics: More detailed Prometheus metrics
Future-Proof: Active development and new features

Health Check System

Health Check System

Overview

Configuration

Recommended Configuration

Configuration Options

Health Check Types

Readiness Checks (/readiness)

HTTP Endpoints

Readiness Endpoint (/readiness)

Healthy Response (200 OK)

Unhealthy Response (503 Service Unavailable)

Prometheus Metrics

Overall Status Metrics

Error Metrics

Performance Metrics

Individual Check Metrics

Graceful Shutdown

Extending the System

Adding New Checkers

Registering Custom Checkers

Troubleshooting

Common Issues

Debug Logging

Manual Testing

Architecture

Components

Flow

Migration from Legacy System

Benefits of the Unified System

See Also

Readiness Checks (`/readiness`)

Readiness Endpoint (`/readiness`)