Back to Vector

RFC 9460 - 2021-12-01 - Health endpoint improvements

rfcs/2021-12-01-9460-health-endpoint-improvements.md

0.55.05.3 KB
Original Source

RFC 9460 - 2021-12-01 - Health endpoint improvements

Our existing /health endpoint is limited and additional information should be exposed for both operators and load balancers/service discovery.

Context

Scope

In scope

  • Update health endpoint to improve Vector's reliability in production deployments
  • Update default configurations to improve Vector's out-of-the-box reliability
  • Improve visibility of high level status at runtime

Out of scope

  • Immediate integrations to expose component level health

Pain

  • Today's health endpoint causes 502's when running behind a loadbalancer
  • The healthcheck isn't integrated with any components and only represents if Vector itself is running
  • Operators don't have visibility into the state of specific components, ex. is backpressure being applied to upstream components

Proposal

User Experience

The existing health endpoint on Vector's API is updated to return 503's when Vector is shutting down. This additional response will allow load balancers and service discovery to better integrate with Vector, sources already begin rejecting requests received during the shutdown process and updating the health check will allow for removing the instance from available backends and avoid routing traffic to an instance that will end up rejecting the request regardless.

The API will not start until the topology is build and validated as functional, this isn't precisely when Vector's configured sources can actually process events but it's a reasonable first step to ensure graceful startups.

Implementation

We have existing logic in place for other components to handle Vector's shutdown. The health handler can be updated to respond 503 if shutdown has started and 200 otherwise. This should not negatively impact existing deployments.

Rationale

This improvement is low effort but greatly improves the experience of running Vector behind a load balancer or in a service discovery system. As we recommend both of those options for production deployments there's very little reason not to implement this.

Drawbacks

N/A

Prior Art

Alternatives

Do nothing

Not adding additional responses to the existing health endpoint seems strictly worse, as we don't have any ways to stop routing traffic to instances that are shutting down. Those requests will already be rejected by sources and this behaviour reflects poorly on zero downtime deployments, even though events shouldn't be dropped.

vector health subcommand

While not suitable for general load balancer usage, it's very easy to exec commands in Kubernetes to determine health. For prior art we don't have to look further than the Datadog Agent which has a subcommand that outputs current health information that can be used by operators/systems.

Outstanding Questions

  • Do we need to update the { health } GraphQL query at the same time?
  • Should we assign a unique error code for "shutting down"?

Plan Of Attack

  • Integrate health endpoint with our shutdown sequence, having the API return a 503 and take the shutting down instance out of load balancing/service discovery
  • Verify the /health endpoint is unavailable until Vector's topology is build and valid to run

Future Improvements

  • Expand Vector's concept of "health" to a component level, and define a spec for both the /health endpoint as well as component "health"
  • Add routes/optional params to health endpoint to query the health of specific components
  • Add a "tiered" health status (Green/Yellow/Red) to better represent the "distributed" nature of Vector's runtime (Elasticsearch health endpoints as an example)
  • Add healthchecks for sources (and transforms), to better determine Vector's ability to receive and process events
  • Update sinks (and other components) regularly rerun their configured healthchecks