rfd/0223-k8s-health-checks.md
Enable automated Kubernetes cluster health checks that are viewed by the Web UI, Teleport Connect, tctl command, and Prometheus metrics.
Proxied Kubernetes cluster health may be manually exercised with Kubernetes operations in the Web UI, or kubectl command, and observing the result. While effective, manual checking is also slow and unscalable.
Automated Kubernetes health checks with new methods of viewing improve maintainability of Teleport Kubernetes clusters, while enabling new scenarios.
Automated health checks:
Alice enrolls three Amazon EKS clusters into Teleport through the Web UI.
The next day she returns to the Web UI Resources tab to find Amazon EKS clusters are highlighted with warnings. She clicks on an EKS tile and a health message is displayed in a side panel.
Kubernetes Cluster Issues
3 Teleport Kubernetes clusters report issues.
Affected Teleport Kubernetes cluster:
- Hostname: sol
UUID: 52dedbd0-b165-4bf6-9bc3-961f95bf481d
Error: Unable to retrieve pods from the Kubernetes cluster. Please see the Kubernetes Access Troubleshooting guide, https://goteleport.com/docs/enroll-resources/kubernetes-access/troubleshooting/.
Affected Teleport Kubernetes cluster:
- Hostname: jupiter
UUID: bb4dc171-ffa7-4a31-ba8c-7bf91c59e250
Error: Unable to retrieve pods from the Kubernetes cluster. Please see the Kubernetes Access Troubleshooting guide, https://goteleport.com/docs/enroll-resources/kubernetes-access/troubleshooting/.
Affected Teleport Kubernetes cluster:
- Hostname: saturn
UUID: 2be08cb1-56a4-401f-a3f3-c755a73f3ff6
Error: Unable to retrieve pods from the Kubernetes cluster. Please see the Kubernetes Access Troubleshooting guide, https://goteleport.com/docs/enroll-resources/kubernetes-access/troubleshooting/.
Alice notices that each cluster has similar access denied errors.
Alice applies new Kubernetes RBAC, and clusters return to a healthy state.
As she monitors the Teleport Web UI, she sees each Amazon EKS tile switch from a warning state to a normal state.
tctl - Configuring a New Health CheckBob reads about Kubernetes health checks in a Teleport changelog, and updates a Teleport cluster to the new major version.
Bob runs tctl get health_check_config/default from a terminal to view the default health settings.
version: v1
metadata:
name: "default"
labels:
teleport.internal/resource-type: preset
spec:
match:
db_labels:
- name: "*"
values:
- "*"
kubernetes_labels:
- name: "*"
values:
- "*"
He notices a new kubernetes_labels matcher.
He exercises the Kubernetes health checks in non-production environments.
Bob runs tctl edit health_check_config/default from a terminal, updating the default settings to exclude Kubernetes health checks from a production environment.
version: v1
metadata:
name: "default"
labels:
teleport.internal/resource-type: preset
spec:
match:
db_labels:
- name: "*"
values:
- "*"
kubernetes_labels:
- name: "*"
values:
- "*"
kubernetes_labels_expression: "labels.env != `prod`"
Bob runs tctl get kube_server/luna from a terminal, validating that the expected Kubernetes cluster is monitoring health.
kind: kube_server
metadata:
expires: "2025-10-26T00:00:00.000000Z"
name: luna
revision: 43e96231-faaf-43c3-b9b8-15cf91813389
spec:
host_id: 278be63c-c87e-4d7e-a286-86002c7c45c3
hostname: luna
status:
target_health:
addr: luna:3027
protocol: TLS
transition_timestamp: "2025-10-25T00:00:00.000000Z"
transition_reason: "healthy threshold reached"
status: healthy
version: 19.0.0
version: v3
Charlie relies on Prometheus to notify him of outages and calls to action.
He reads about Kubernetes cluster health being available with Prometheus metrics.
Charlie tests the feature.
He enrolls three GKE instances to Teleport.
He runs Prometheus query expressions to check the new health metrics.
teleport_resources_health_status_unhealthy{type="kubernetes"}
# Returns 0, the number of unhealthy Kubernetes clusters
teleport_resources_health_status_healthy{type="kubernetes"} +
teleport_resources_health_status_unhealthy{type="kubernetes"} +
teleport_resources_health_status_unknown{type="kubernetes"}
# Returns 3, the total number of Kubernetes clusters
Charlie sets one GKE instance into an unhealthy state and requeries.
teleport_resources_health_status_unhealthy{type="kubernetes"}
# Returns 1, the number of unhealthy Kubernetes clusters
teleport_resources_health_status_healthy{type="kubernetes"} +
teleport_resources_health_status_unhealthy{type="kubernetes"} +
teleport_resources_health_status_unknown{type="kubernetes"}
# Returns 3, the total number of Kubernetes clusters
Seeing metric values returning, he sets up a Prometheus alerting rule.
groups:
- name: teleport_kubernetes
rules:
- alert: KubernetesClusterUnhealthy
expr: |
teleport_resources_health_status_unhealthy{type="kubernetes"} > 0
for: 5m
labels:
severity: warning
team: platform
component: teleport
service: kubernetes
annotations:
summary: "{{ $value }} Kubernetes cluster(s) unhealthy in Teleport"
description: "Teleport reports {{ $value }} unhealthy Kubernetes cluster(s). Kubernetes clusters registered with Teleport are failing health checks. Check Teleport Web UI or use tctl get kube_server for details."
query: teleport_resources_health_status_unhealthy{type="kubernetes"} > 0
Prometheus alerts him about the unhealthy Kubernetes cluster.
Charlie sets the GKE instance into a healthy state and moves on with his day.
Kubernetes health checks are discussed by functional areas of core logic, tctl command, Web UI, and Prometheus metrics.
Teleport Kubernetes health checks use the Teleport healthcheck package, and is based on existing database health check design patterns. The healthcheck components are written, tested, and in production. The focus and effort for Kubernetes is integrating health checks into the Kubernetes agent, extending existing healthcheck mechanisms, and updating the UI.
A first step to enabling Kubernetes health checks is adding new matchers to the HealthCheckConfig service. HealthCheckConfig identifies servers which choose to participate in health checking.
Matchers kubernetes_labels and kubernetes_labels_expression are added to specify labeled Kubernetes clusters. By default, the preset setting is defined to enable all Kubernetes clusters to participate in health checks. Manually specifying matchers may filter out specific Kubernetes clusters. Editing HealthCheckConfig to omit Kubernetes matchers would exclude all Kubernetes clusters from health checks.
An example health_check_config:
version: v1
metadata:
name: "default"
labels:
teleport.internal/resource-type: preset
spec:
match:
kubernetes_labels:
- name: "*"
values:
- "*"
kubernetes_labels_expression: "labels.env != `prod`"
db_labels:
- name: "*"
values:
- "*"
db_labels_expression: "labels.env != `prod`"
HealthCheckConfig may be communicated via proxy and is cached on a Kubernetes agent. Kubernetes Go interfaces are updated to support the proxy communication and caching. Pre-existing mechanisms for configuring HealthCheckConfig with interval, timeout, and healthy/unhealthy thresholds are described in the database health check RFD.
The Kubernetes agent registers one or more Kubernetes clusters, checks the health of proxied Kubernetes clusters, and communicates a health state back to the auth server. The agent adds a healthcheck.Manager, which registers Kubernetes clusters and schedules health checks.
healthcheck PackageThe healthcheck package performs recurring health checks on one or more Teleport resources: databases, Kubernetes clusters, etc. It's a general library that currently supports TCP checks. Extending support to Kubernetes API calls is a focus.
The healthcheck.Target adds a CheckHealth function field, and removes a ResolverFn function field. The existing logic using ResolverFn for database health checks is encapsulated into a new database-specific CheckHealth function then passed to the Target.CheckHealth function field.
// Target is a health check target.
type Target struct {
// GetResource gets a copy of the target resource with updated labels.
GetResource func() types.ResourceWithLabels
- // ResolverFn resolves the target endpoint(s).
- ResolverFn EndpointsResolverFunc
+ // Checks the health of a target resource.
+ CheckHealth func(ctx context.Context) error
}
A healthcheck worker calls the new CheckHealth function.
Prometheus gauge metrics are also added to the healthcheck package, and described in the Prometheus Implementation.
Kubernetes cluster health is detected by calling Kubernetes API SelfSubjectAccessReview endpoints through TLS.
Four API calls are made per health check.
Endpoint /apis/authorization.k8s.io/v1/selfsubjectaccessreviews:
The API calls exercise Teleport Kubernetes RBAC. Positive responses indicate the Kubernetes cluster is properly configured with a Teleport ClusterRole and responds to requests.
A healthy Kubernetes cluster is defined as a customer being able to use a Kubernetes cluster. Exercising the /selfsubjectaccessreviews endpoint enables checking RBAC in addition to other layers of Kubernetes functionality. It addresses usability from a customer's point of view.
TCP and other Kubernetes health endpoints readyz / livez / healthz were explored. Each indicates a level of Kubernetes cluster health, none ensures that a customer can actually use the cluster.
Kubernetes offers an API with several health check endpoints, as well as TCP checks being available.
| Approach | Description |
|---|---|
| /readyz | Ready to accept API requests |
| /readyz?verbose | Ready to accept API requests (detailed) |
| /livez | kube-apiserver process is alive/running |
| /livez?verbose | kube-apiserver process is alive/running (detailed) |
| /healthz | Ambiguously alive or ready. Deprecated in 2019 at v1.16 |
| TCP | Can establish TCP connection to API server port |
Let's explore the options.
/readyz means that the cluster is accepting API requests, and can be used.
/livez indicates the Kubernetes kube-apiserver process is alive. API requests may or may not be accepted. There's no implication of whole cluster readiness.
/healthz is deprecated, and not supported with the Kubernetes health check feature. /healthz was deprecated in September of 2019 with Kubernetes v1.16. At the time of writing, Kubernetes is at v1.33, and six years have passed since v1.16. It seems reasonable not to support the /healthz endpoint. That choice would then set up a requirement for customers to use Kubernetes v1.16 or higher with Teleport Kubernetes health checks.
Moving on to TCP, TCP indicates that network connectivity is available. No further knowledge of Kubernetes health would be known. In scenarios where servers don't offer explicit health checks, such as databases, TCP may be the only choice. Since Kubernetes offers health checks, we can skip TCP checks.
So, /livez and TCP indicate some level of health, but do not necessarily mean the Kubernetes cluster can be used.
Let's look at the verbose query parameter.
/readyz?verbose provides a list of Kubernetes modules with ok / not ok states. The verbose information is not critical in the common case of a healthy cluster returning a 200 HTTP status code. The verbose information may be helpful to an administrator diagnosing an unhealthy cluster.
For efficiency in the common case of a healthy cluster, the /readyz endpoint could be called and checked for a 200 status code. In nearly all cases we would only need to check 200, and the verbose body message would not be sent, reducing unneeded network, memory, and processor consumption. Also, the Kubernetes authors recommend relying on the status code for checking state.
In the case of non-200 response codes, a follow-up call to /readyz?verbose could be made. The follow-up verbose message may be appended to a Go error, and eventually forwarded to the Web UI for a Teleport administrator to view.
An example /readyz?verbose response body for a 503 Service Unavailable HTTP status code:
[+]ping ok
[+]log ok
[-]etcd not ok: client: etcd cluster is unavailable or misconfigured: context deadline exceeded
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
...
[+]shutdown ok
readyz check failed
Calling /readyz with a fallback to /readyz?verbose provides the healthy / unhealthy states, with diagnostics when needed. It would not necessarily indicate that a customer has properly configured Teleport Kubernetes RBAC. And would not indicate that the customer can actually use a Kubernetes cluster.
Further alternatives considered:
/readyz -> /readyz?verbose -> /livez?verbose -> TCP. If each call returned a non-200 error, a fallback approach could be selected. The scenario would capture as much information as is available for the Teleport administrator. The approach is not selected, as it's seen as over-engineering for minimal return.kube-state-metrics. Having a downed node doesn't necessarily imply a cluster is unhealthy; perhaps at reduced capacity, or possibly non-functional. The complexity grows, and may best be addressed by observability solutions.[!NOTE] Node health != Kubernetes cluster health
[!NOTE] Pod health != Kubernetes cluster health
tctl ImplementationPlanned changes to HealthCheckConfig percolate to tctl.
No further changes are made for tctl.
Previous planning and implementation work from database health checks makes displaying Kubernetes health checks straight-forward. No new visual design patterns or coding design patterns are necessary. A Kubernetes health check UI implementation has surgical insertion points.
TargetHealth property and kube_cluster if/switch logic is added in approximately nine files.
The Teleport Connect UI is implemented at the same time as the Web UI. Teleport Connect shares UI components, such as UnifiedResource, making the implementation closely related.
User friendly error messages are displayed with a link to the Kubernetes Access Troubleshooting guide. The guide will be updated with each error message and resolution steps.
A Kubernetes cluster may be in a healthy, unhealthy, or unknown state.
healthy indicates a Kubernetes cluster may be used by a customerunhealthy indicates a Kubernetes cluster is an error state, and includes an error messageunknown indicates a Kubernetes cluster cannot be contactedSee the database health check RFD for more details.
Three Prometheus metrics are added to the healthcheck package:
teleport_resources_health_status_healthy for the number of healthy resourcesteleport_resources_health_status_unhealthy for the number of unhealthy resourcesteleport_resources_health_status_unknown for the number of resources in an unknown health stateThe metrics are designed to observe resource health, support multiple resource types (databases, Kubernetes, etc), while keeping the quantity of Prometheus metrics to a minimum. Applying a Prometheus label type="db|kubernetes|etc" to a metric distinguishes one resource from another.
Use a PromQL expression to determine the total number of Kubernetes clusters.
teleport_resources_health_status_healthy{type="kubernetes"} +
teleport_resources_health_status_unhealthy{type="kubernetes"} +
teleport_resources_health_status_unknown{type="kubernetes"}
Use a PromQL expression to detect the presence of unhealthy Kubernetes clusters.
teleport_resources_health_status_unhealthy{type="kubernetes"} > 0
When an unhealthy Kubernetes cluster is detected, a Teleport administrator may use the Teleport Web UI, tctl command, or kubectl command to identify the Kubernetes cluster, and diagnose further.
The metrics are implemented as gauges to enable incrementing and decrementing (a counter only increments), and as a Vec to enable multiple resource types, db|kubernetes|etc.
Here are example metric definitions:
// teleport_resources_health_status_healthy
resourceHealthyGauge = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: teleport.MetricNamespace,
Subsystem: teleport.MetricResourcesHealthStatus,
Name: teleport.MetricHealthy,
Help: "Number of healthy resources",
},
[]string{teleport.TagType}, // db|k8s|etc
)
// teleport_resources_health_status_unhealthy
resourceUnhealthyGauge = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: teleport.MetricNamespace,
Subsystem: teleport.MetricResourcesHealthStatus,
Name: teleport.MetricUnhealthy,
Help: "Number of unhealthy resources",
},
[]string{teleport.TagType}, // db|k8s|etc
)
// teleport_resources_health_status_unknown
resourceUnknownGauge = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: teleport.MetricNamespace,
Subsystem: teleport.MetricResourcesHealthStatus,
Name: teleport.MetricUnknown,
Help: "Number of resources in an unknown health state",
},
[]string{teleport.TagType}, // db|k8s|etc
)
Health check metrics are incremented and decremented in worker.go during state changes, and decremented on closing. Each metric is labeled with a resource type such as db or kubernetes.
Database health checks, which use the existing healthcheck package, now emit health metrics. Future components using the healthcheck package, such as MWI, would also emit health metrics.
Alternatives not chosen:
healthcheck package, would enable Kubernetes-specific metric naming without a type label. Metric names teleport_kubernetes_enrolled and teleport_kubernetes_available may be more evident on first reading. PromQL expressions would be slightly more simple, teleport_kubernetes_enrolled - teleport_kubernetes_available. Existing Teleport Kubernetes metrics consistently use prefix teleport_kubernetes_*. The main drawback of the approach is less reusability across Teleport agents which check health. Each agent would need to write its own health check metrics. The healthcheck package is written for reusability across agents. Adding Prometheus metrics to healthcheck is a well-factored design at the cost of more verbose metric names, teleport_health_resources{type="resource"} and teleport_health_resources_available{type="resource"}.healthcheck package metrics addition, metric naming might be tailored to an existing naming pattern of teleport_<resource>_enrolled / teleport_<resource>_available. A pattern of using type predicate metrics is also present for teleport_connected_resources{type="resource"} and teleport_reverse_tunnels_connected{type="resource"}. There are multiple valid approaches used in Teleport. The implementation of a type predicate is simpler with a single Prometheus GaugeVec. The teleport_<resource>_enrolled would use a map to Prometheus Gauge, essentially duplicating GaugeVec. This is a neutral design choice.Also see the Prometheus docs for metric and label naming practices, and a cardinality is key blog exploring the topic.
In HA deployment scenarios, proxy routing to healthy Kubernetes clusters will be considered. Database health checks provides proxy routing based healthy database connections.
The existing HealthCheckConfig supports Terraform operations, and will be extended with Kubernetes matchers.
User documentation will be updated with Kubernetes health checks, similar to database health checks.
The documentation will point out that Prometheus metrics only track configured health checks. It's possible for Kubernetes clusters to exist, and be skipped in health check monitoring, and not visible from Prometheus.
Kubernetes health check documentation is modeled on the existing Database Health Checks documentation.
The Kubernetes Access Troubleshooting guide will be updated with user-friendly error returned by health checks, and related resolution steps.
Health check calls are made with TLS between Kubernetes agents and proxied Kubernetes clusters.
The existing health_check_config configuration is extended, and the existing RBAC security applies.
Users who are authorized to view tctl get kube_server can see health info, which is previously guarded by RBAC.
See the database health checks RFD for more details on health_check_config.
Several existing protobufs are extended and one new message is added.
Changes focus on adding TargetHealth and label matchers.
Definitions apply modern protobuf naming conventions, and omit depreciated gogoproto tags.
legacy/types/types.proto
// KubernetesServerV3 represents a Kubernetes server.
message KubernetesServerV3 {
option (gogoproto.goproto_stringer) = false;
option (gogoproto.stringer) = false;
// Kind is the Kubernetes server resource kind. Always "kube_server".
string Kind = 1 [(gogoproto.jsontag) = "kind"];
// SubKind is an optional resource subkind.
string SubKind = 2 [(gogoproto.jsontag) = "sub_kind,omitempty"];
// Version is the resource version.
string Version = 3 [(gogoproto.jsontag) = "version"];
// Metadata is the Kubernetes server metadata.
Metadata Metadata = 4 [
(gogoproto.nullable) = false,
(gogoproto.jsontag) = "metadata"
];
// Spec is the Kubernetes server spec.
KubernetesServerSpecV3 Spec = 5 [
(gogoproto.nullable) = false,
(gogoproto.jsontag) = "spec"
];
+ // Status is the Kubernetes server status.
+ KubernetesServerStatusV3 status = 6;
}
+// KubernetesServerStatusV3 is the Kubernetes cluster status.
+message KubernetesServerStatusV3 {
+ // TargetHealth is the health status of network connectivity between
+ // the agent and the Kubernetes cluster.
+ TargetHealth target_health = 1;
+}
healthcheckconfig/v1/health_check_config.proto
// Matcher is a resource matcher for health check config.
message Matcher {
// DBLabels matches database labels. An empty value is ignored. The match
// result is logically ANDed with DBLabelsExpression, if both are non-empty.
repeated teleport.label.v1.Label db_labels = 1;
// DBLabelsExpression is a label predicate expression to match databases. An
// empty value is ignored. The match result is logically ANDed with DBLabels,
// if both are non-empty.
string db_labels_expression = 2;
+ // KubernetesLabels matches kubernetes labels. An empty value is ignored. The match
+ // result is logically ANDed with KubernetesLabelsExpression, if both are non-empty.
+ repeated teleport.label.v1.Label kubernetes_labels = 3;
+ // KubernetesLabelsExpression is a label predicate expression to match kubernetes. An
+ // empty value is ignored. The match result is logically ANDed with KubernetesLabels,
+ // if both are non-empty.
+ string kubernetes_labels_expression = 4;
}
lib/teleterm/v1/kube.proto
// Kube describes connected Kubernetes cluster
message Kube {
// uri is the kube resource URI
string uri = 1;
// name is the kube name
string name = 2;
// labels is the kube labels
repeated Label labels = 3;
+ // target_health is the health of the kube cluster
+ TargetHealth target_health = 4;
}
Kubernetes health checks are backported to v18.
The healthcheck package used by Kubernetes health checks was introduced in v18, and is unsupported in v16 and v17.
Backward compatibility for adding Kubernetes label matchers to v18 was tested and verified to function properly.
Testing was performed on a development machine.
Teleport v19 proof-of-concept was run with auth+proxy+kube. The health check config was edited to include only Kubernetes label matchers. Storage of the edited health check config to the backend events table was double checked by viewing /health_check_config/default key data with DB Browser for SQLite. Only the Kubernetes wildcard matchers were present (no db matchers present).
Teleport v18 auth+proxy+kube was run with the identical configuration and backend database. v18 runs without issue. Kubernetes health checks are simply omitted on a v18 without backporting.
Validation of health check config is performed only on writes to the backend database with ValidateHealthCheckConfig, and not during reads of the config.
Customers would update to v19 or a v18 with a backport to participate in Kubernetes health checks.
No new audit events.
Existing health_check_config Create/Update/Delete events are exercised.
Three Prometheus metrics are implemented in the healthcheck package, and described in the Prometheus Implementation.
Log messages are emitted when helpful.
A Kubernetes health check test plan closely mirrors the database health check plan.
The following steps are added:
### Kubernetes Health Checks
- [ ] Verify health checks with `tctl`
- [ ] `tctl get kube_server` includes `kube_server.status.target_health` info
- [ ] `tctl update health_check_config` resets `kube_server.status.target_health.status` with matching Kubernetes clusters. This may take several minutes.
- [ ] Disabling health checks shows `kube_server.status.target_health` as "unknown/disabled". This may take several minutes. There are a couple ways to achieve this.
- [ ] `tctl update health_check_config`
- [ ] `tctl delete health_check_config`
- [ ] Verify health checks with the web UI
- [ ] Configure a Kubernetes agent with a Kubernetes cluster with an unreachable endpoint.
- [ ] The web UI resource page shows a warning indicator for that Kubernetes cluster with error details.
- [ ] Without restarting the Kubernetes agent, make the Kubernetes cluster endpoint reachable. Observe that the warning indicator disappears after some time.
- [ ] Exercise agents/proxies running an older version of Teleport in a mixed fleet scenario.
- [ ] Exercise only Kubernetes matchers (no db matchers)
- [ ] Exercise Kubernetes matchers and database matchers
- [ ] Exercise zero matchers
Implementation starts with foundational elements in core health checks, and continues to UI and documentation.
Prometheus metrics are added to the healthcheck package. This is straight-forward addition with few dependencies.
Protobufs form a foundation and are well-defined.
Integrate health checks into the Kubernetes agent and proxy.
Health checks are performed from the Teleport Kubernetes agent and makes calls to Kubernetes clusters.
Health checks are reported to the Teleport auth server.
Kubernetes health checks are configured and viewable from tctl.
Kubernetes health checks are displayed and updated in the Web UI.
Kubernetes health checks are displayed and updated in the Teleport Connect UI.
Add user documentation.