rfcs/2021-12-01-9460-health-endpoint-improvements.md
Our existing /health endpoint is limited and additional information should be
exposed for both operators and load balancers/service discovery.
/health endpoint/heath endpointThe existing health endpoint on Vector's API is updated to return 503's when Vector is shutting down. This additional response will allow load balancers and service discovery to better integrate with Vector, sources already begin rejecting requests received during the shutdown process and updating the health check will allow for removing the instance from available backends and avoid routing traffic to an instance that will end up rejecting the request regardless.
The API will not start until the topology is build and validated as functional, this isn't precisely when Vector's configured sources can actually process events but it's a reasonable first step to ensure graceful startups.
We have existing logic in place for other components to handle Vector's shutdown. The health handler can be updated to respond 503 if shutdown has started and 200 otherwise. This should not negatively impact existing deployments.
This improvement is low effort but greatly improves the experience of running Vector behind a load balancer or in a service discovery system. As we recommend both of those options for production deployments there's very little reason not to implement this.
N/A
Not adding additional responses to the existing health endpoint seems strictly worse, as we don't have any ways to stop routing traffic to instances that are shutting down. Those requests will already be rejected by sources and this behaviour reflects poorly on zero downtime deployments, even though events shouldn't be dropped.
vector health subcommandWhile not suitable for general load balancer usage, it's very easy to exec
commands in Kubernetes to determine health. For prior art we don't have to look
further than the Datadog Agent which has a subcommand that outputs current
health information that can be used by operators/systems.
{ health } GraphQL query at the same time?503 and take the shutting down instance out of load balancing/service discovery/health endpoint is unavailable until Vector's topology is build
and valid to run/health endpoint as well as component "health"sources (and transforms), to better determine Vector's
ability to receive and process events