website/src/content/posts/2025-06-02-faster-route-propagation-by-rewriting-our-traefik-gateway-in-rust/page.mdx
import imgColdStarts from "./cold-starts.svg"; import imgRivetGuardArch from "./rivet-guard-arch.svg"; import imgRivetStack from "./rivet-guard-stack.svg"; import imgTraefikArch from "./traefik-arch.svg";
Rivet Guard is our gateway service that handles all incoming traffic to the Rivet platform. Its core responsibilities include:
Traefik is a widely used reverse proxy that makes it incredibly easy to set up load balancing & routing in a variety of environments.
To ship the original version of Rivet, we chose Traefik because of its ease of use for configuring dynamic routes. We always knew this was a temporary solution until we could write a more mature gateway ourselves.
We took advantage of the Traefik HTTP provider to dynamically configure routes for the MVP of Rivet. Our setup consisted of 3 components:
Part 1: Slow route propagation
It's incredibly important that Rivet Functions, Actors, and Containers are immediately available when provisioned. We put a lot of work into our infrastructure to make this happen, but Traefik has always been the one piece that slowed down how fast we can make resources available.
Under the hood, we had Traefik configured to poll at a 500 ms interval and providersThrottleDuration configured to 0.025s. If we tried to set the interval any faster, it caused internal backpressure. Despite being configured as such, Traefik does not expose routes within 500 ms in practice. Instead, we found it reliably taking between 1-2 seconds. Therefore, we had to add an artificial timeout of 2 seconds just to wait for our routes to propagate reliably within that time window.
All of the work we had put into making our backend & infrastructure provision resources blazing fast was going to waste with a measly timeout.
Part 2: Large HTTP provider responses
Every single route available on Traefik required a configuration entry for:
These routes come out to a minimum 1.3 KB of JSON per route — assuming the developer has not configured additional middleware.
<Accordion title="Example HTTP route">A single route looks roughly like this:
{
"http": {
"services": {
"job-run:00000000-0000-0000-0000-000000000000:game_default": {
"loadBalancer": { "servers": [ { "url": "http://10.0.0.1:24534" } ]
}
}
},
"routers": {
"job-run:00000000-0000-0000-0000-000000000000:example-hash-1:https": {
"entryPoints": [ "lb-443" ],
"rule": "Host(`11111111-1111-1111-1111-111111111111-default.lobby.22222222-2222-2222-2222-222222222222.rivet.run`)",
"service": "job-run:00000000-0000-0000-0000-000000000000:game_default",
"middlewares": [ "job-rate-limit", "job-in-flight" ],
"tls": { "domains": [ { "main": "lobby.22222222-2222-2222-2222-222222222222.rivet.run", "sans": [ "*.lobby.22222222-2222-2222-2222-222222222222.rivet.run" ] } ] }
},
"job-run:00000000-0000-0000-0000-000000000000:example-hash-2:https": {
"entryPoints": [ "lb-443" ],
"rule": "(Host(`lobby.22222222-2222-2222-2222-222222222222.rivet.run`) && PathPrefix(`/11111111-1111-1111-1111-111111111111-default`))",
"service": "job-run:00000000-0000-0000-0000-000000000000:game_default",
"middlewares": [ "job-rate-limit", "job-in-flight", "job-run-strip-prefix:00000000-0000-0000-0000-000000000000:example-hash-3" ],
"tls": { "domains": [ { "main": "lobby.22222222-2222-2222-2222-222222222222.rivet.run/11111111-1111-1111-1111-111111111111-default", "sans": [] } ] }
}
},
"middlewares": {
"job-run-strip-prefix:00000000-0000-0000-0000-000000000000:example-hash-3": {
"stripPrefix": { "prefixes": [ "/11111111-1111-1111-1111-111111111111-default" ] }
}
}
}
}
At around 3 MB JSON responses (just over 100 routes, or 800 routes spread across our 8 regions), Traefik's latency started to increase noticeably.
Our load tests had previously shown that Traefik completely stops working once your HTTP provider JSON config hits ~14 MB, so we knew it was time to upgrade the Rivet gateway or bad things would happen.
When designing Rivet Guard, we wanted it to support:
Rivet Guard is built differently than most other proxies: it's built as a library that we can internally plug in our own custom routing handlers. Its routes are lazily resolved & cached instead of being kept in memory.
When a request is made to Rivet Guard, it will call a user-provided function called routing_fn that will return the endpoint to connect to lazily. This endpoint will be cached for future requests.
In contrast to the previous Traefik configuration, this provides the following benefits:
When configured with a routing_fn that can resolve routes from an external datasource, the new gateway architecture for Rivet looks like:
Configuring Rivet Guard is done by defining three functions:
routing_fn: For a given hostname and path, return the target endpoints to route tomiddleware_fn: For a given endpoint, return the middleware (i.e. rate limiting, max in-flight, retries, timeout)cert_resolver_fn: For a given hostname, return the TLS certificate to useConfiguring Rivet Guard looks like this:
#[tokio::main]
async fn main() -> Result<()> {
// Routing
let routing_fn = |hostname: &str, path: &str, port_type: rivet_guard_core::proxy_service::PortType| async {
// ...fetch routes here...
Ok(Some(RoutingOutput::Route(RouteConfig {
targets: vec![
RouteTarget { host, port, path },
// ...more targets...
],
timeout: RoutingTimeout { routing_timeout },
})))
};
// Middleware
let middleware_fn = |actor_id: &uuid::Uuid| async {
// ...fetch middleware configuration...
Ok(MiddlewareResponse::Ok(MiddlewareConfig {
rate_limit: RateLimitConfig { requests, period },
max_in_flight: MaxInFlightConfig { amount },
retry: RetryConfig { max_attempts, initial_interval },
timeout: TimeoutConfig { request_timeout },
}))
};
// TLS certificates
let cert = load_certified_key(&CertificatePair {
name: "My Cert",
cert_path: Path::new(&tls_config.actor_cert_path).into(),
key_path: Path::new(&tls_config.actor_key_path).into(),
})?;
let cert_resolver_fn = move |hostname: &str| {
// ... match hostname to the appropriate cert...
Ok(cert.clone())
};
// Start server
rivet_guard_core::run_server(config, routing_fn, middleware_fn, cert_resolver_fn).await?
}
In practice, Rivet's actual routing function looks like this:
let routing_fn = |hostname: &str, path: &str, port_type: rivet_guard_core::proxy_service::PortType| async {
match actor_routes::route_via_route_config(&ctx, host, path, dc_id).await {
Ok(Some(x)) => { return Ok(x); }
Ok(None) => { /* Continue to next routing method */ }
Err(err) => { return Ok(error_response(err)); }
}
match actor::route_actor_request(&ctx, host, path, dc_id).await {
Ok(Some(x)) => { return Ok(x); }
Ok(None) => { /* Continue to next routing method */ }
Err(err) => { return Ok(error_response(err)); }
}
match api::route_api_request(&ctx, host, path, &dc.name_id, dc_id).await {
Ok(Some(x)) => { return Ok(x); }
Ok(None) => { /* Continue to next routing method */ }
Err(err) => { return Ok(error_response(err)); }
}
// No matching route found
Ok(RoutingOutput::Response(StructuredResponse {
status: StatusCode::NOT_FOUND,
message: Cow::Owned(format!(
"No route found for hostname: {host}, path: {path}"
)),
docs: None,
}))
};
Rivet Guard is responsible for handling everything from the TCP connection to the actual proxying. Accepting a request handles the following steps:
cert_resolver_fnhyperrouting_fnmiddleware_fntokio-tungsteniteMost of this is pretty standard across reverse proxies, except for steps 4 and 7. We'll dive into how these work in practice.
Now for the good part: how does rewriting this critical piece of software make it so much faster?
Instead of waiting 2 seconds for Traefik to poll our configuration, we can immediately look up where to route a request with routing_fn — even if the route isn't ready to serve requests yet (more on that later).
Under the hood, almost everything in Rivet is powered by FoundationDB. Route lookups are a single key get operation, which takes between 0.1-1 ms. That means that our worst-case routing latency is 1 millisecond — which is 2,000x faster than our previous 2 second latency.
After the first request to a route, all following requests are cached, meaning that the 1 millisecond overhead is a one-off scenario & future requests take a negligible amount of time to resolve. More on this later.
Most infrastructure like Kubernetes takes a long time to start containers because you have to wait for many steps (download image, start container, wait for server ready, wait for health check).
This means that whenever you need to start a resource, you usually need to wait a long time for it to come up before you have a URL you can access it from.
On the other hand, Rivet returns a URL to access a resource before the resource is ready. This way, the resource has time to start up while the connection URL is being sent back to the client over WAN (which is bottlenecked by the speed of light) and while the caller's backend processes the request.
<Frame caption="Create Container Sequence" className="max-w-3xl mx-auto"> </Frame>This means that instead of waiting for:
Rivet Guard can do:
This sequence assumes that a container URL has to be sent back to the client over WAN, which adds a full round trip. This is a common pattern for container workloads like game servers, desktop automation, and code sandboxes. Even if you don't need to share the URL to a container publicly, this still significantly cuts latency given that your backend needs time to process the response in parallel with the container starting.
However, if the resource is not ready by the time the client makes the request, normally a reverse proxy would fail with something like a Bad Gateway error.
Instead, Rivet Guard is able to intelligently buffer requests until the resource is ready then delivers the requests.
This has the added benefit of gracefully handling resource crashes & restarts. Say a resource becomes unresponsive or restarts: this mechanism can intelligently buffer and send requests once the service accepts the underlying TCP connection. This is incredibly important for Rivet Actors and Rivet Containers where — unlike traditional edge functions — there are frequently singletons (only one instance of a single resource) which need to be able to be gracefully restarted.
This pattern is well proven in systems like FoundationDB where instead of doing rolling deploys for upgrades, you simply restart the database with the new version. The clients are able to intelligently replay requests so there is no impact to the client. Read more about FoundationDB's upgrade process to see prior art to this pattern.
The key part of this architecture is that the routing function is called lazily and cached for speedy responses.
The caching is done by returning the host & path prefix from the routing function. For example, if the user requests example.com/foo/bar, but we know that the route prefix is /foo, we'll cache that route for all future requests to example.com/foo/* (e.g. including example.com/foo/baz).
The value that gets cached is the targets & the middleware (in separate caches with different timeouts).
If we cache routes and the target changes (e.g. resource upgrades or migrates to a different machine), how do we know when we need to update the target list?
We could implement a cache invalidation system by sending a message to all Rivet Guard instances (similar to what we did for Traefik with NATS), but this requires adding another moving part to a system that we're trying to keep as simple as possible. (We'll still implement this later to minimize p99 when restarting resources & pick up new targets faster.)
Instead, we can implement a much simpler alternative: if we can't route to a target, invalidate the cache and try again:
A keen eye may notice — what if Container B gets scheduled on 1.2.3.4:5678 before our cache can be invalidated with a connection refused to Container A? Wouldn't that incorrectly route requests meant for Container A to Container B incorrectly? Yep — that's why we reserve the port after it's been released for a duration longer than our cache expiration, so this never happens.
Now for the fun part — provoking the age-old Crustaceans-Gophers Wars.
I want to be clear that this is not your usual "Rewrite It In Rust" post — I firmly believe Go would have been a great choice for this as well.
Either would have been well suited for the job, but these are some factors that made us comfortable choosing Rust instead of Go, despite Go's extensive history with Traefik, Istio, and other networking infrastructure tools:
Option preventing null pointer panics, sum types making invalid states unrepresentable, and concrete error types forcing explicit error handling. I can't overstate the importance of for something as critical as our gateway that touches every request that reaches Rivet.Our use case is not totally unique. Most large companies need complicated dynamic route discovery, so there are mature tools in the ecosystem for this.
Envoy is used heavily in services like Consul for dynamic configuration. Envoy works via using a "discovery service" (xDS) as described in their dynamic configuration documentation.
We could've exposed a customized Endpoint Discovery Service (EDS) and Route Discovery Service (RDS) over gRPC or REST to support this.
Why we went with our own custom solution:
rivet-guard-core has a lot of similarities to Cloudflare's Pingora project which supports programmable network services.
Pingora likely would've met our use case with flying colors. It came down to: the MVP for what we need is not that large. rivet-guard-core is only 1,777 physical lines of code (excluding tests) and rivet-guard (which implements routing logic) is only 861 physical lines of code.
Similar to our reason for not using Envoy, we have a lot of edge cases in what our gateway needs to do, so it was a safer bet in time investment (both for MVP and our future projects) to build it from scratch rather than run into a wall with an external library and have to fork + try to get merged.
Despite Rivet Guard's source being relatively small, I don't want to understate the difficulty of writing a reverse proxy. Reverse proxies are easy to implement, but even easier to implement incorrectly. Most of the time developing Rivet Guard was spent writing test cases to handle edge cases around TLS, WebSockets, retries, caching, common DoS attacks, etc.
There are many features that we're looking forward to being able to support having a solid foundation with Rivet Guard.
These are features that we hope to implement in the future that influenced our design:
Building Rivet Guard has unlocked a massive amount of performance benefits, flexibility with future features, and removed the biggest bottleneck to scaling Rivet.
Key technical improvements include:
Sometimes you just have to build the exact tool you need.
Edit (June 4, 2025): Fix typo of calling Go not "memory safety" and replace with "correctness guarantees"