design/20220118.certificate-issuance-exponential-backoff.md
Implement a way to apply exponential backoff when retrying failed certificate
issuances. Issuance in this context refers to the period of time during which
the Issuing condition on a Certificate is set to true and for which a new
set of issuance-specific resources (CertificateRequests, Orders, Challenges etc) is
created.
Currently failed issuances are retried once an hour without a backoff or time limit. This means that 1) continuous failures in large installations can overwhelm external services 2) rate limits can be easily hit in case of longer lasting issuance problems (see Let'sEncrypt rate limits)
Ensure that retrying failed issuances does not overwhelm external services and is less likely to hit rate limits by adding exponentially increasing delays between issuance retries
Ensure that when the backoff is being applied, users have a way to find out when the issuance will be next retried and that the backoff mechanism does not introduce extra complexity for debugging and fits in with the already existing 1 hour backoffs
Ensure that cmctl renew can still be used to manually force an immediate issuance attempt, so that in cases where the issuance was failing due to a setup error (i.e DNS setup) and a user believes that they have fixed it, they have a way to verify the fix without waiting up to 32h
Introduce backoff period that is shorter than the current static 1 hour backoff period to allow for issuance of short lived certs to be retried as that is a separate concern from backing off exponentially and is complex enough to be worked on separately
Make the backoff period configurable as this would add a lot of extra complexity. For context Kubernetes pod crashloopbackoff period is not configurable (although it is a very demanded feature k/k#57291)
Make it possible to reset the backoff period (However, it would be possible to force re-issuance to be retried immediately using cmctl renew and, if that succeeded, the backoff would be reset)
Cause all Certificates whose issuances are currently failing to be re-issued at once after cert-manager controller restart
Cause all Certificates whose issuances are currently failing to be re-issued at once after upgrading to a cert-manager version that implements exponential backoff
Exponential backoff will be implemented by exponentially increasing the delays between a failed issuance (Issuing condition set to false in certificates-issuing controller) and a new issuance (Issuing condition set to true in certificates-trigger controller). From a user perspective, this will correspond to the delay between a CertificateRequest having failed and new CertificateRequests being created.
A new IssuanceAttempts status field will be added to Certificate that will be used to record the number of consecutive failed issuances.
Similarly to status.LastFailureTime, status.IssuanceAttempts field will only be set for a Certificate whose issuance is currently failing and will be removed after a successful issuance.
IssuanceAttempts will be set by certificates-issuing controller after a failed issuance by either bumping the already existing value by 1 or setting it to 1 (first failure). In case of a succeeded issuance, certificates-issuing controller will ensure that status.IssuanceAttempts is not set.
The delay till the next issuance will then be calculated by certificates-trigger controller using the formula if status.LastFailureTime != nil then next_issuance_attempt_time = status.LastFailureTime + time.Hour x 2 ^ (status.IssuanceAttempts - 1) (binary exponential, so the sequence will be 1h, 2h, 4h, 8h etc). This ensures that the first delay is 1 hour from the last failure time which is the current behaviour. In case of continuous failures, the delay should keep increasing up to a maximum backoff period of 32h, after which it should be retried every 32h whilst the failures persist.
A new IssuanceAttempts field will be added to Certificate's status.
type CertificateStatus {
// EXISTING FIELDS
// ...
// NEW FIELDS
// IssuanceAttempts represents the number of consecutive failed issuances.
// This field is used to calculate the backoff period after which issuance will be attempted again.
IssuanceAttempts int `json:issuanceAttempts,omitempty`
}
Large part of the these examples show what is already the current behaviour, the only changes are the parts where IssuanceAttempts field is being managed and where the delay between issuances is calculated with an exponential backoff.
A CertificateRequest fails. This is the 3rd failed issuance in a row
certificates-issuing controller reconciles the failed CertificateRequest, bumps the status.IssuanceAttempts by 1 as well as updating the status.LastFailureTime to the time when CertificateRequest failed and setting the Issuing condition to false (in failIssueCertificate)
certificates-trigger controller parses the Certificate with the false Issuing condition, calculates the backoff period (in this case it will be status.LastFailureTime + 2h ^ (3 - 1), so roughly in 4 hours) in shouldBackoffReissuingOnFailure and enqueues the Certificate to be reconciled in 4 hours (c.scheduleRecheckOfCertificateIfRequired)
In 4 hours, Certificate gets reconciled again and certificates-trigger controller sets the Issuing condition to true. This time the CertificateRequest succeeds.
certificates-issuing controller reconciles the Certificate and the succeeded CertificateRequest and removes the status.IssuanceAttempts as well as status.LastFailureTime and Issuing condition
certificates-trigger controller determines that backoff is not needed and re-queues Certificate to be renewed based on status.RenewalTime
A CertificateRequest fails. This is the 3rd failed issuance in a row
certificates-issuing controller reconciles the failed CertificateRequest, bumps the status.IssuanceAttempts by 1 as well as updating the status.LastFailureTime to the time when CertificateRequest failed and setting the Issuing condition to false
certificates-trigger controller parses the Certificate with the false Issuing condition, calculates the backoff period (in this case it will be status.LastFailureTime + 2h ^ (3 - 1), so roughly in 4 hours) in shouldBackoffReissuingOnFailure and enqueues the Certificate to be reconciled in 4 hours (c.scheduleRecheckOfCertificateIfRequired)
User fixes the reason for failure (i.e some networking setup) and runs cmctl renew <certificate-name> to force immediate re-issuance, which adds Issuing condition to the Certificate thus signalling the other controllers that issuance is in progress and bypassing the certificates-issuing controller's check for whether a backoff is needed
A new CertificateRequest is created and succeeds
certificates-issuing controller reconciles the Certificate and the succeededCertificateRequest and removes the status.IssuanceAttempts as well as status.LastFailureTime and Issuing condition
certificates-trigger controller parses the Certificate, determines that backoff is not needed and requeues Certificate to be renewed based on status.RenewalTime
A CertificateRequest fails. This is the 3rd failed issuance in a row.
certificates-issuing controller reconciles the failed CertificateRequest, bumps the status.IssuanceAttempts by 1 as well as updating the status.LastFailureTime to the time when CertificateRequest failed and setting the Issuing condition to false
certificates-trigger controller parses the Certificate with the false Issuing condition, calculates the backoff period (in this case it will be status.LastFailureTime + 2h ^ (3 - 1), so roughly in 4 hours) in shouldBackoffReissuingOnFailure and enqueues the Certificate to be reconciled in 4 hours (c.scheduleRecheckOfCertificateIfRequired)
User thinks that they have fixed the failure (i.e some networking setup) and runs cmctl renew <certificate-name> to force immediate re-issuance, which adds Issuing condition to the Certificate thus signalling the other controllers that issuance is in progress and bypassing the certificates-issuing controller's check for whether a backoff is needed
A new CertificateRequest is created and fails again
certificates-issuing controller reconciles the Certificate and the failed CertificateRequest, bumps status.IssuanceAttempts to 4, sets the Issuing condition to false and sets status.LastFailureTime to now
certificates-trigger controller parses the Certificate with the false Issuing condition, calculates the backoff period (in this case it will be status.LastFailureTime + 2h ^ (4 - 1), so roughly in 8 hours) in shouldBackoffReissuingOnFailure and enqueues the Certificate to be reconciled in 8 hours (c.scheduleRecheckOfCertificateIfRequired)
(These examples are based on what the statuses already look like after a failed/succeeded issuance. The only change is the issuanceAttempts field)
Certificate where issuance has failed 3 times in a row:Status:
Conditions:
- LastTransitionTime: <timestamp>
Message: <message>
ObservedGeneration: 1
Reason: Ready
Status: "True"
Type: Ready
- LastTransitionTime: <timestamp>
Message: <message>
ObservedGeneration: 1
Reason: Failed
Status: "False"
Type: Issuing # Issuing condition remains set, but false after a failed issuance
NotAfter: <timestamp>
NotBefore: <timestamp>
RenewalTime: <timestamp>
IssuanceAttempts: 3
LastFailureTime: <timestamp> # Last failed issuance (i.e when a `CertificateRequest` failed)
Revision: 19
Events:
Certificate where the latest issuance succeeded and no issuances are being attempted now:Status:
Conditions:
- LastTransitionTime: <timestamp>
Message: <message>
ObservedGeneration: 1
Reason: Ready
Status: "True"
Type: Ready
NotAfter: <timestamp>
NotBefore: <timestamp>
RenewalTime: <timestamp>
Revision: 20
Events:
The example flows described in Examples and Upgrading will be tested via integration tests (similar to the current integration tests for certificates)
Upgrading to a cert-manager version that implements exponential backoff or downgrading to one that does not, should not require any extra steps or cause unnecessary re-issuances.
To ensure that Certificates whose issuance is currently failing don't get renewed all at once after upgrading to cert-manager version that implements exponential backoff, certificates-trigger controller should fall back to 1 hour delay for all Certificates that have status.LastFailureTime set, but don't have the status.IssuanceAttempts set.
Although work on this was prompted by wanting to limit calls to ACME servers, users of other types of issuers will benefit from it
1h is an acceptable initial delay between issuances (keeping this to 1h ensures that at least for the first retry attempt, we keep the current behaviour, however perhaps for short lived certs it would be useful to start with a shorter initial delay?)
32h is an acceptable maximum delay between issuances
In case of exponential backoff being applied, controller logs will be sufficient for users trying to debug this and understand when the next issuance will be attempted
Applying exponential backoff in cases where issuance fails due to a denied CertificateRequest should not be treated differently to other failures (so exponential backoff should be applied). Currently they are treated the same and retried after 1 hour, so this is consistent with the existing behaviour
The current assumption is that exponential backoff would not be placed behind a feature gate, however this should be considered.
Some reasons for putting it behind a feature gate:
If the feature gate was to be implemented, it would mean adding a new flag to controller binary (i.e --enable-exponential-backoff) and adding some if-statements to certificates-issuing and certificates-trigger controllers to not add the status.IssuanceAttempts field and not parse it, unless the feature gate is enabled. (Question: would IssuanceAttempts still be added to the v1 Certificate API?)