docs/rfcs/029-getpage-throttling.md
Author: Christian Schwarz Date: Oct 24, 2023
This RFC proposes per-tenant throttling of GetPage@LSN requests inside Pageserver and the interactions with its client, i.e., the neon_smgr component in Compute.
The result of implementing & executing this RFC will be a fleet-wide upper limit for "the highest GetPage/second that Pageserver can support for a single tenant/shard".
Pageserver exposes its page_service.rs as a libpq listener.
The Computes' neon_smgr module connects to that libpq listener.
Once a connection is established, the protocol allows Compute to request page images at a given LSN.
We call these requests GetPage@LSN requests, or GetPage requests for short.
Other request types can be sent, but these are low traffic compared to GetPage requests
and are not the concern of this RFC.
Pageserver associates one libpq connection with one tokio task.
Per connection/task, the pq protocol is handled by the common postgres_backend crate.
Its run_message_loop function invokes the page_service specific impl<IO> postgres_backend::Handler<IO> for PageServerHandler.
Requests are processed in the order in which they arrive via the TCP-based pq protocol.
So, there is no concurrent request processing within one connection/task.
There is a degree of natural pipelining: Compute can "fill the pipe" by sending more than one GetPage request into the libpq TCP stream. And Pageserver can fill the pipe with responses in the other direction. Both directions are subject to the limit of tx/rx buffers, nodelay, TCP flow control, etc.
The Compute has its own hierarchy of caches, specifically shared_buffers and the local file cache (LFC).
Compute only issues GetPage requests to Pageserver if it encounters a miss in these caches.
If the working set stops fitting into Compute's caches, requests to Pageserver increase sharply -- the Compute starts thrashing.
In INC-69, a tenant issued 155k GetPage/second for a period of 10 minutes and 60k GetPage/second for a period of 3h, then dropping to ca 18k GetPage/second for a period of 9h.
We noticed this because of an internal GetPage latency SLO burn rate alert, i.e., the request latency profile during this period significantly exceeded what was acceptable according to the internal SLO.
Sadly, we do not have the observability data to determine the impact of this tenant on other tenants on the same tenants.
However, here are some illustrative data points for the 155k period:
The tenant was responsible for >= 99% of the GetPage traffic and, frankly, the overall activity on this Pageserver instance.
We were serving pages at 10 Gb/s (155k x 8 kbyte (PAGE_SZ) per second is 1.12GiB/s = 9.4Gb/s.)
The CPU utilization of the instance was 75% user+system.
Pageserver page cache served 1.75M accesses/second at a hit rate of ca 90%.
The hit rate for materialized pages was ca. 40%.
Curiously, IOPS to the Instance Store NVMe were very low, rarely exceeding 100.
The fact that the IOPS were so low / the materialized page cache hit rate was so high suggests that this tenant's compute's caches were thrashing.
The compute was of type k8s-pod; hence, auto-scaling could/would not have helped remediate the thrashing by provisioning more RAM.
The consequence was that the thrashing translated into excessive GetPage requests against Pageserver.
My claim is that it was unhealthy to serve this workload at the pace we did:
predictability of performance: when the working set grows, the pageserver materialized page cache hit rate drops. At some point, we're bound by the EC2 Instance Store NVMe drive's IOPS limit. The result is an uneven performance profile from the Compute perspective.
economics: Neon currently does not charge for IOPS, only capacity. We cannot afford to undercut the market in IOPS/$ this drastically; it leads to adverse selection and perverse incentives. For example, the 155k IOPS, which we served for 10min, would cost ca. 6.5k$/month when provisioned as an io2 EBS volume. Even the 18k IOPS, which we served for 9h, would cost ca. 1.1k$/month when provisioned as an io2 EBS volume. We charge 0$. It could be economically advantageous to keep using a low-DRAM compute because Pageserver IOPS are fast enough and free.
Note: It is helpful to think of Pageserver as a disk, because it's precisely where neon_smgr sits:
vanilla Postgres gets its pages from disk, Neon Postgres gets them from Pageserver.
So, regarding the above performance & economic arguments, it is fair to say that we currently provide an "as-fast-as-possible-IOPS" disk that we charge for only by capacity.
The consequence of the above analysis must be that Pageserver throttles GetPage@LSN requests. That is, unless we want to start charging for provisioned GetPage@LSN/second. Throttling sets the correct incentive for a thrashing Compute to scale up its DRAM to the working set size. Neon Autoscaling will make this easy, eventually.
What that remains is the question about policy and mechanism:
Policy concerns itself with the question of what limit applies to a given connection|timeline|tenant. Candidates are:
Regarding mechanism, it's clear that backpressure is the way to go. However, we must choose between
Further, there is the question of how throttling GetPage@LSN will affect the internal GetPage latency SLO: where do we measure the SLI for Pageserver's internal getpage latency SLO? Before or after the throttling?
And when we eventually move the measurement point into the Computes (to avoid coordinated omission), how do we avoid counting throttling-induced latency toward the internal getpage latency SLI/SLO?
This RFC proposes introducing a hard GetPage@LSN/second limit per tenant, with the same value applying to each tenant on a Pageserver.
This proposal is easy to implement and significantly de-risks operating large Pageservers, based on the assumption that extremely-high-GetPage-rate-episodes like the one from the "Motivation" section are uncorrelated between tenants.
For example, suppose we pick a limit that allows up to 10 tenants to go at limit rate. Suppose our Pageserver can serve 100k GetPage/second total at a 100% page cache miss rate. If each tenant gets a hard limit of 10k GetPage/second, we can serve up to 10 tenants at limit speed without latency degradation.
The mechanism for backpressure will be TCP-based implicit backpressure. The compute team isn't concerned about prefetch queue depth. Pageserver will implement it by delaying the reading of requests from the libpq connection(s).
The rate limit will be implemented using a per-tenant token bucket.
The bucket will be be shared among all connections to the tenant.
The bucket implementation supports starvation-preventing awaiting.
The current candidate for the implementation is leaky_bucket.
The getpage@lsn benchmark that's being added in https://github.com/neondatabase/neon/issues/5771
can be used to evaluate the overhead of sharing the bucket among connections of a tenant.
A possible technique to mitigate the impact of sharing the bucket would be to maintain a buffer of a few tokens per connection handler.
Regarding metrics / the internal GetPage latency SLO: we will measure the GetPage latency SLO after the throttler and introduce a new metric to measure the amount of throttling, quantified by:
Further observability measures:
Rollout will happen as follows:
The per-tenant override will remain for emergencies and testing. But since Console doesn't preserve it during tenant migrations, it isn't durably configurable for the tenant.
Toward the upper layers of the Neon stack, the resulting limit will be "the highest GetPage/second that Pageserver can support for a single tenant".
We decided against error + retry because of worries about starvation.
Enable per-tenant emergency override of the limit via Console. Should be part of a more general framework to specify tenant config overrides. NB: this is not the right mechanism to sell different max GetPage/second levels to users, or auto-scale the GetPage/second levels. Such functionality will require a separate RFC that concerns itself with GetPage/second capacity planning.
Compute-side metrics for GetPage latency.
Back-channel to inform Compute/Autoscaling/ControlPlane that the project is being throttled.
Compute-side neon_smgr improvements to avoid sending the same GetPage request multiple times if multiple backends experience a cache miss.
Dealing with read-only endpoints: users use read-only endpoints to scale reads for a single tenant. Possibly there are also assumptions around read-only endpoints not affecting the primary read-write endpoint's performance. With per-tenant rate limiting, we will not meet that expectation. However, we can currently only scale per tenant. Soon, we will have sharding (#5505), which will apply the throttling on a per-shard basis. But, that's orthogonal to scaling reads: if many endpoints hit one shard, they share the same throttling limit. To solve this properly, I think we'll need replicas for tenants / shard. To performance-isolate a tenant's endpoints from each other, we'd then route them to different replicas.