Meshery Health Check Configuration

This document provides guidance on configuring Kubernetes health checks for Meshery deployments using the enhanced /healthz/live and /healthz/ready endpoints.

Overview

Meshery implements Kubernetes-compliant health check endpoints that follow best practices from the Kubernetes API server:

/healthz/live - Liveness probe endpoint
/healthz/ready - Readiness probe endpoint

Both endpoints support a ?verbose=1 query parameter for detailed health check information.

Health Check Behavior

Liveness Probe (`/healthz/live`)

The liveness probe checks if Meshery server is running and responsive. Returns:

200 OK with "ok" when the server is alive
503 Service Unavailable when provider capabilities are not loaded

Readiness Probe (`/healthz/ready`)

The readiness probe checks if Meshery is ready to accept traffic. It performs:

Capability Check (affects readiness): Verifies provider capabilities are loaded
Extension Check (informational only): Reports extension package status

Returns:

200 OK when capabilities are loaded (ready to serve traffic)
503 Service Unavailable when capabilities are not loaded

Note: Extension status is informational and does not affect the readiness state.

Default Configuration

The Meshery Helm chart includes pre-configured health checks in values.yaml:

yaml

probe:
  livenessProbe:
    enabled: true
    initialDelaySeconds: 80
    periodSeconds: 12
    failureThreshold: 4

  readinessProbe:
    enabled: true
    initialDelaySeconds: 10
    periodSeconds: 4
    failureThreshold: 4

Installation Considerations

Initial Installation

When installing Meshery for the first time:

Liveness Probe: Set initialDelaySeconds to allow time for the server to start and load configurations
- Recommended: 80-120 seconds for first startup
- This accounts for image pull, provider initialization, and capability loading
Readiness Probe: Can have a shorter delay as it will automatically retry
- Recommended: 10-30 seconds
- Lower periodSeconds (e.g., 4-5) for faster readiness detection

Example for new installations:

yaml

probe:
  livenessProbe:
    enabled: true
    initialDelaySeconds: 120  # Allow more time for initial setup
    periodSeconds: 15
    failureThreshold: 5

  readinessProbe:
    enabled: true
    initialDelaySeconds: 20
    periodSeconds: 5
    failureThreshold: 4

Upgrade Considerations

When upgrading an existing Meshery deployment:

Rolling Update Strategy: The default deployment uses rolling updates which rely on health checks
Capability Loading: Provider capabilities must reload during upgrade
Extension Availability: Extension packages may need to be re-downloaded

Recommended settings for upgrades:

yaml

probe:
  livenessProbe:
    enabled: true
    initialDelaySeconds: 60   # Shorter than initial install
    periodSeconds: 12
    failureThreshold: 4

  readinessProbe:
    enabled: true
    initialDelaySeconds: 15   # Allow time for capability reload
    periodSeconds: 5
    failureThreshold: 6       # Higher threshold during upgrades

Advanced Configuration

Verbose Mode for Debugging

During troubleshooting, you can manually check health status with verbose output:

bash

# Check liveness with details
kubectl exec -n meshery deployment/meshery -- curl -s "http://localhost:8080/healthz/live?verbose=1"

# Check readiness with details
kubectl exec -n meshery deployment/meshery -- curl -s "http://localhost:8080/healthz/ready?verbose=1"

Example verbose output:

[+]capabilities ok
[i]extension extension package found
healthz check passed

Where:

[+] indicates a passing health check
[-] indicates a failing health check
[i] indicates informational status (doesn't affect health)

Custom Probe Configuration

For specific deployment scenarios, you can customize probe settings:

High Availability Setup

yaml

probe:
  livenessProbe:
    enabled: true
    initialDelaySeconds: 90
    periodSeconds: 10
    failureThreshold: 3
    timeoutSeconds: 5

  readinessProbe:
    enabled: true
    initialDelaySeconds: 15
    periodSeconds: 3
    failureThreshold: 3
    successThreshold: 1
    timeoutSeconds: 3

Resource-Constrained Environments

yaml

probe:
  livenessProbe:
    enabled: true
    initialDelaySeconds: 150  # More time for slower startup
    periodSeconds: 20
    failureThreshold: 5

  readinessProbe:
    enabled: true
    initialDelaySeconds: 30
    periodSeconds: 10
    failureThreshold: 6

Startup Probes (Kubernetes 1.18+)

For better handling of slow-starting containers, consider adding a startup probe:

yaml

# Add to deployment.yaml template
{{- if .Values.probe.startupProbe.enabled }}
startupProbe:
  httpGet:
    path: /healthz/live
    port: http
  initialDelaySeconds: {{ .Values.probe.startupProbe.initialDelaySeconds }}
  periodSeconds: {{ .Values.probe.startupProbe.periodSeconds }}
  failureThreshold: {{ .Values.probe.startupProbe.failureThreshold }}
{{- end }}

And in values.yaml:

yaml

probe:
  startupProbe:
    enabled: true
    initialDelaySeconds: 0
    periodSeconds: 10
    failureThreshold: 30  # 30 * 10s = 5 minutes max startup time

Monitoring and Troubleshooting

Common Issues

Pods stuck in CrashLoopBackOff
- Check if initialDelaySeconds is too short
- Verify provider capabilities are being loaded
- Use verbose mode to identify which check is failing
Pods not becoming ready
- Check readiness probe logs with verbose mode
- Ensure provider configuration is correct
- Verify network connectivity to provider endpoints
Frequent restarts
- Increase failureThreshold to tolerate temporary issues
- Increase periodSeconds to reduce probe frequency
- Check for resource constraints (CPU/memory)

Debug Commands

bash

# Check pod status
kubectl get pods -n meshery

# View pod events
kubectl describe pod -n meshery <pod-name>

# Check health endpoint directly
kubectl port-forward -n meshery deployment/meshery 8080:8080
curl http://localhost:8080/healthz/ready?verbose=1

# View container logs
kubectl logs -n meshery deployment/meshery -f

Migration from Previous Versions

If upgrading from a version without enhanced health checks:

The endpoints maintain backward compatibility
Default probe configurations will work without changes
Consider adjusting timing parameters based on your environment
Test the upgrade in a staging environment first

Best Practices

Always enable both probes - Liveness and readiness serve different purposes
Set appropriate delays - Account for provider initialization time
Use startup probes - For Kubernetes 1.18+ to handle initial startup separately
Monitor probe metrics - Use Kubernetes events and metrics to tune settings
Test in staging - Validate probe settings before production deployment
Document custom settings - If you modify defaults, document why

Meshery Health Check Configuration

Meshery Health Check Configuration

Overview

Health Check Behavior

Liveness Probe (/healthz/live)

Readiness Probe (/healthz/ready)

Default Configuration

Installation Considerations

Initial Installation

Upgrade Considerations

Advanced Configuration

Verbose Mode for Debugging

Custom Probe Configuration

High Availability Setup

Resource-Constrained Environments

Startup Probes (Kubernetes 1.18+)

Monitoring and Troubleshooting

Common Issues

Debug Commands

Migration from Previous Versions

Best Practices

References

Liveness Probe (`/healthz/live`)

Readiness Probe (`/healthz/ready`)