rfd/0083-machine-id-host-certs.md
Allow Machine ID bots to issue host certificates for OpenSSH servers, per our documented manual instructions.
This has two components:
Issuing host certs today requires someone with admin-level permissions to
manually issue certificates via tctl auth sign .... If these certificates
expire or if the cluster CAs are rotated, these certificates need to be manually
refreshed. This is inconvenient and encourages users to distribute long lived
credentials when onboarding OpenSSH nodes.
In Teleport today, host certificates are not modelled as a real resource.
However, we do expose a host_cert RBAC resource. Any user granted the create
verb on this (virtual) resource may call GenerateHostCert() with any
parameters they wish and receive a set of host certificates.
This is reasonable for human cluster administrators with short-lived credentials protected by MFA, but problematic for long-lived machine credentials. Stolen credentials, even if short lived, could be used to issue additional long-lived certificates. While these host credentials don't provide a large level of access to the Teleport cluster, they could be used to impersonate real nodes.
To mitigate this, we'd like to implement support for where clauses for
host_cert RBAC rules as we do for rules backed by real Teleport resources. For
example, this role would allow a user to issue host certs, but only for
principals ending in foo.example.com:
kind: role
version: v5
metadata:
name: example
spec:
allow:
rules:
- resources: [host_cert]
verbs: [create]
where: all_end_with(host_cert.principals, ".foo.example.com")
This involves:
Adding new comparison functions to better match against DNS names. Minimally,
we'll add all_end_with(inputs, suffix) and all_equal(inputs, value), but
may consider additional matchers in the future if there's demand. Methods
like regexes may have problematic quoting and escaping requirements that may
not be worth solving in the underlying predicate library, so we'll avoid
implementing these until a meaningful use-case exists.
Ensuring these functions support lists of strings (or building separate
functions specifically for matching against []string).
For example, a hypothetical all_end_with(host_cert.principals, ".example.com")
should AND the result for each input principal. Luckily this does not require
any changes to the underlying library, and we can confine this behavior to
only new functions to avoid impact to existing rules.
Evaluating where clauses in GenerateHostCert(). As these are not
proper Teleport resources, the predicate context doesn't include any
useful values to compare against. We'll want to add a new optional
host_cert field much like SSHSession, then provide a custom context in
GenerateHostCert().
Additional context fields will be passed to the predicate parser to allow
further restriction of certificate issuance, including the role, cluster
name, host ID, and node name. While present, the TTL field may require
additional custom predicate functions to compare durations. Most of these
fields (other than principals) are normally unset for SSH nodes.
We should provide (and document) a sane rule that ensures regular SSH host cert rules are followed, alongside users' own requirements (e.g. DNS suffix).
SSH host certificates issued today via tctl auth sign do not have an
expiration date. While we could preserve this behavior, any process that
continually produces certificates with an infinite TTL seems problematic.
Our preference is to generate similarly short-lived certificates as we do for other bot credentials; this has certain (minor) caveats as explored in the UX section below.
We have several options for issuing host certificates. Factors to consider include:
sshd gracefully reload these certificates? We found that sshd re-reads
the certificate file for each incoming connection, so no reload is necessary
unless the sshd_config file itself has changed (e.g. changed the path to
the host cert / key).Config Templates in Machine ID are refreshed each iteration of the bots usual renewal loop (20min by default). This would be the most straightforward integration approach and we'd approach it like we do other special cert types for apps / databases / k8s clusters.
Users would request a host cert via the config file:
destinations:
- directory: /opt/machine-id
configs:
- host_cert:
principals: [foo.example.com]
While rendering the config template, the bot would call GenerateHostCert(),
format the resulting certificate and key appropriately, and write it to disk.
A new certificate would be written at startup, then approximately every 20
minutes while running normally.
Per testing, OpenSSH re-reads certificate files on demand, so no additional work
is needed to make use of these certificates once paths are configured.
Additionally, sshd does play nice with tbot init's ACL implementation, and
does not care about file permissions unless they're owned by root.
A downside to this approach is that permissions errors (for example, if the
where predicate doesn't match) will be reported as errors but will not crash
the bot.
sshd caveat with short-lived certificatesWhen sshd host certificates expire (because tbot crashed, for instance),
users will see an error like the following:
Certificate invalid: expired
The authenticity of host '192.168.122.6.foo.example.com (<no hostip for proxy command>)' can't be established.
RSA key fingerprint is SHA256:CWqUJ7q3uPGX9gMoD7R76Hi8pJsoSL8SA0J1FIMmOc8.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])?
This is nearly identical to the usual ssh TOFU message, save for the
easy-to-miss "Certificate invalid: expired" message. Users are likely
conditioned to accept this, and if that happens the expired or invalid host key
will be committed to their known_hosts permanently, after which the "expired"
message will not be shown again.
We'll need to document this caveat along with a workaround (e.g. a
ssh-keygen -R command to remove the old entry) to help users avoid connecting
to potentially untrusted hosts.
Users can already use bot identities to generate host certs with the host_cert
permission granted, by passing the Machine ID identity file to tctl:
$ tctl -i path/to/identity auth sign \
--host=foo.example.com \
--format=openssh \
--out=myhost
This UX works fine in tandem with secure certificate issuance, but users still need to reissue certs when CAs are rotated or when compromised. Users could potentially use cron / systemd timers / etc to automate this.
If we deem that these certs should be renewed at a different interval than other bot resources, we could add a new renewal loop for specifically these certificates. It would be additionally triggered on CA rotation, as regular renewals are today.
Open questions: bots still renew all certs at startup. Do we maintain that behavior for these? If not, how do we know when to renew certs next?
GenerateHostCert() calls