design/retryon-design.md
RetryPolicyStatus: Draft
When specifying a retryPolicy in an HTTPProxy, Envoy is configured to retry all 5xx status codes.
This proposal seeks to add fields to the RetryPolicy that give the user the ability to specify the conditions under which a retry should occur (e.g. certain status codes).
When Kubernetes registers and deregisters Pods to receive network traffic, it takes Contour a non-zero amount of time (less than a second) to update the upstreams in Envoy. For services with a high number of requests per second (RPS), this can result in a handful of 503s during this period.
At present, users can specify a retryPolicy in their HTTPProxy that will retry all 5xx status codes.
When configured properly (i.e. with an appropriate number of retries and time between retries) this mitigates the delay of Contour updating Envoy upstreams.
However, retrying all 5xx is heavy handed when one only wants to retry upstream connection errors.
One could simply add an option to the retryPolicy that tells Envoy to retry upstream connection errors, but this seems too specific to the problem described above.
One could also expose all the conditional options of retries in Envoy, but this seems too broad a solution, may complicate the HTTPProxy spec too much, and may enable users to do more than is reasonable.
Thus, this proposal suggests a way to expose the conditions for retrying a request in a general but limited fashion. It is designed to be applicable to the specific problem above while also meeting the needs of others who may wish to control the conditions of their retries. It also tries to avoid exposing too broad a set of configuration options for retries.
HTTPProxy manifestThis proposal would add two new fields to the retryPolicy -- retryOn and retriableStatusCodes.
These fields would map to the retry_on and retriable_status_codes fields of the Envoy v2 route.RetryPolicy, respectively.
Below is an example YAML of what this would look like, as an extension of the existing example of a retryPolicy from the HTTPProxy reference:
apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
name: response-timeout
namespace: default
spec:
virtualhost:
fqdn: timeout.bar.com
routes:
- timeoutPolicy:
response: 1s
idle: 10s
retryPolicy:
count: 3
perTryTimeout: 150ms
retryOn:
- connect-failure
retriableStatusCodes:
- 503
- 504
services:
- name: s1
port: 80
In the above example, a request would be retried on the following conditions:
Two new fields will be added to the v1.RetryPolicy of the HTTPProxy spec:
RetryOn []RetryOn
Optional
Slice of x-envoy-retry-on conditionsRetriableStatusCodes []uint32
Optional
Slice of HTTP status codesThe RetryOn field uses a string type alias with kubebuilder validation.
This ensures that the HTTPProxy only contains valid values for the v1.RetryPolicy.RetryOn field:
// +kubebuilder:validation:Enum=5xx;gateway-error;reset;connect-failure;retriable-4xx;refused-stream;retriable-status-codes;retriable-headers
type RetryOn string
One new field will be added to the dag.RetryPolicy:
RetriableStatusCodes []uint32
Slice of HTTP status codesNew logic will be introduced when transforming a v1.RetryPolicy into a dag.RetryPolicy:
v1.RetryPolicy.RetryOn is nil or empty, it will be set to the default value of []string{"5xx"} for backwards compatibilityv1.RetryPolicy.RetryOn will be joined into a string, with values separate by commas, and set as the value of dag.RetryPolicy.RetryOnv1.RetryPolicy.RetriableStatusCodes will be set as the value of dag.RetryPolicy.RetriableStatusCodesNew logic will be introduced when transforming a dag.RetryPolicy into a Envoy v2 route.RetryPolicy:
dag.RetryPolicy.RetryOn will be set as the value of route.RetryPolicy.RetryOndag.RetryPolicy.RetriableStatusCodes will be set as the value of route.RetryPolicy.RetriableStatusCodesNo further changes should be necessary to propagate these new configurable fields to Envoy.
As discussed earlier, we could expose a single option for only retrying connection upstream errors. Example:
retryPolicy:
count: 3
perTryTimeout: 150ms
retryOnConnectFailure: true
Arguments for this approach:
Arguments against this approach:
HTTPProxy is too short sighted and does not provide a path adding additional conditions in an elegant wayretry_on value for EnvoyInstead of a list of strings for retryOn, we could make every possible value a unique boolean field.
Example:
retryPolicy:
count: 3
perTryTimeout: 150ms
retryOn:
5xx: false
connectFailure: true
gatewayError: false
...
Arguments for this approach:
retry_on field withretry_on fieldArguments against this approach:
retry_on fieldInstead of configuring retry policies per HTTPProxy, we could allow configuration of a global retry policy that applies to all proxies.
This would solve the author's issue of mitigating upstream connect errors during rollouts without having to configure all HTTPProxy manifests with the same retry policy.
If implemented, this global retry policy would live in Contour's global configuration file and likely resemble the same structure at the retryPolicy section of the HTTPProxy manifest.
Taking this approach would require consideration around how retry policies defined in HTTPProxy would consolidate with a global retry policy:
RetryOn field could be a good candidate for merging elements.)The desired behavior here does not seem clear cut enough to apply to all use cases for retry policies, therefore it seems more reasonable to forego global retry policies for the time being until the point if/when a standard or common use case becomes clear.
Unfortunately this may result in multiple HTTPProxy manifests sharing the same retry policy. However, existing tools such as Helm are well equipped to manage and this type of repetition.
With tens of hundreds of HTTPProxy manifests, each with their own retry policy, it's important to consider the impact that many defined retry policies will have at scale.
One concern is the possibility that retry policies might conflict with one another.
This is a non-issue, as each HTTPProxy creates unique routes in Envoy, and the retry policies specified therein will exist and operate independently of each other without conflict, allowing thousands of these policies to coexist without conflict.
Another concern is the impact that many retry policies can have on the performance of Envoy -- can many retries negatively impact Envoy's performance?
Fortunately, Envoy uses circuit breakers which limit the potential for negative performance from an increase in requests.
In Envoy access logs, requests that breach a circuit breaker include the response flag UO.
You'll see this response flag show up for requests that were the result of a retry policy, thus Envoy also applies circuit breakers to retry policies.
The new fields of the retryPolicy -- retryOn and retriableStatusCodes -- are both optional, so existing HTTPProxy manifests will be syntactically compatible.
Furthermore, a nil or empty value for retryOn will result in a default value that is backwards compatible with Contour's existing logic: retrying all 5xx status codes.