RFC 9460 - 2021-12-01 - Health endpoint improvements

Our existing /health endpoint is limited and additional information should be exposed for both operators and load balancers/service discovery.

Context

Scope

In scope

Update health endpoint to improve Vector's reliability in production deployments
Update default configurations to improve Vector's out-of-the-box reliability
Improve visibility of high level status at runtime

Out of scope

Immediate integrations to expose component level health

Pain

Today's health endpoint causes 502's when running behind a loadbalancer
The healthcheck isn't integrated with any components and only represents if Vector itself is running
Operators don't have visibility into the state of specific components, ex. is backpressure being applied to upstream components

Proposal

User Experience

The existing health endpoint on Vector's API is updated to return 503's when Vector is shutting down. This additional response will allow load balancers and service discovery to better integrate with Vector, sources already begin rejecting requests received during the shutdown process and updating the health check will allow for removing the instance from available backends and avoid routing traffic to an instance that will end up rejecting the request regardless.

The API will not start until the topology is build and validated as functional, this isn't precisely when Vector's configured sources can actually process events but it's a reasonable first step to ensure graceful startups.

Implementation

We have existing logic in place for other components to handle Vector's shutdown. The health handler can be updated to respond 503 if shutdown has started and 200 otherwise. This should not negatively impact existing deployments.

Rationale

This improvement is low effort but greatly improves the experience of running Vector behind a load balancer or in a service discovery system. As we recommend both of those options for production deployments there's very little reason not to implement this.

Drawbacks

N/A

Prior Art

Beats
- Liveness and Readiness Probes
FluentBit
- Liveness and Readiness Probes
Elasticsearch
- Readiness Probe
[Datadog Agent]
- Liveness and Readiness Probes
- agent health

Alternatives

Do nothing

Not adding additional responses to the existing health endpoint seems strictly worse, as we don't have any ways to stop routing traffic to instances that are shutting down. Those requests will already be rejected by sources and this behaviour reflects poorly on zero downtime deployments, even though events shouldn't be dropped.

`vector health` subcommand

While not suitable for general load balancer usage, it's very easy to exec commands in Kubernetes to determine health. For prior art we don't have to look further than the Datadog Agent which has a subcommand that outputs current health information that can be used by operators/systems.

Outstanding Questions

Do we need to update the { health } GraphQL query at the same time?
~~Should we assign a unique error code for "shutting down"?~~

Plan Of Attack

Integrate health endpoint with our shutdown sequence, having the API return a 503 and take the shutting down instance out of load balancing/service discovery
Verify the /health endpoint is unavailable until Vector's topology is build and valid to run

Future Improvements

Expand Vector's concept of "health" to a component level, and define a spec for both the /health endpoint as well as component "health"
Add routes/optional params to health endpoint to query the health of specific components
Add a "tiered" health status (Green/Yellow/Red) to better represent the "distributed" nature of Vector's runtime (Elasticsearch health endpoints as an example)
Add healthchecks for sources (and transforms), to better determine Vector's ability to receive and process events
Update sinks (and other components) regularly rerun their configured healthchecks

RFC 9460 - 2021-12-01 - Health endpoint improvements

RFC 9460 - 2021-12-01 - Health endpoint improvements

Context

Scope

In scope

Out of scope

Pain

Proposal

User Experience

Implementation

Rationale

Drawbacks

Prior Art

Alternatives

Do nothing

vector health subcommand

Outstanding Questions

Plan Of Attack

Future Improvements

`vector health` subcommand