website/blog/2022-10-24-buffer-pooling.md
In an earlier blog post about networking security, we described how and why gVisor implements its own userspace network stack in the Sentry (gVisor kernel). In summary, we’ve implemented our networking stack – aka Netstack – in Go to minimize exposure to unsafe code and avoid using an unsafe Foreign Function Interface. With Netstack, gVisor can do all packet processing internally and only has to enable a few host I/O syscalls for near-complete networking capabilities. This keeps gVisor’s exposure to host vulnerabilities as narrow as possible.
<!--/excerpt-->Although writing Netstack in Go was important for runtime safety, up until now it had an undeniable performance cost. iperf benchmarks showed Netstack was spending between 20-30% of its processing time allocating memory and pausing for garbage collection, a slowdown that limited gVisor’s ability to efficiently sandbox networking workloads. In this blog we will show how we crafted a cure for Netstack’s allocation addiction, reducing them by 99%, while also increasing gVisor networking throughput by 30+%.
{:width="100%"}
Go guarantees a basic level of memory safety through the use of a garbage collector (GC), which is described in great detail by the Go team here. The Go runtime automatically tracks and frees objects allocated from the heap, relieving the programmer of the often painful and error-prone process of manual memory management. Unfortunately, tracking and freeing memory during runtime comes at a performance cost. Running the GC adds scheduling overhead, consumes valuable CPU time, and occasionally pauses the entire program’s progress to track down garbage.
Go’s GC is highly optimized, tunable, and sufficient for a majority of workloads. Most of the other parts of gVisor happily use Go's GC with no complaints. However, under high network stress, Netstack needed to aggressively allocate buffers used for processing TCP/IP data and metadata. These buffers often had short lifespans, and once the processing was done they were left to be cleaned up by the GC. This meant Netstack was producing tons of garbage that needed to be tracked and freed by GC workers.
Luckily, we weren't the only ones with this problem. This pattern of small,
frequently allocated and discarded objects was common enough that the Go team
introduced sync.Pool in Go1.3. sync.Pool is
designed to relieve pressure off the Go GC by maintaining a thread-safe cache of
previously allocated objects. sync.Pool can retrieve an object from the cache
if it exists or allocate a new one according to a user specified allocation
function. Once the user is finished with an object they can safely return it to
the cache to be reused again.
While sync.Pool was exactly what we needed to reduce allocations,
incorporating it into Netstack wasn’t going to be as easy as just replacing all
our make()s with pool.Get()s.
Netstack uses a few different types of buffers under the hood. Some of these are
specific to protocols, like
segment
for TCP, and others are more widely shared, like
PacketBuffer,
which is used for IP, ICMP, UDP, etc. Although each of these buffer types are
slightly different, they generally share a few common traits that made it
difficult to use sync.Pool out of the box:
sync.Pool out of the
box can only accommodate buffers of a single size. One common solution to
this is to fill a pool with
bytes.Buffer, but even a pooled
bytes.Buffer could incur allocations if it were too small and had to be
grown to the requested size.We needed an efficient, low-level buffer abstraction that had answers for the Netstack specific challenges and could be shared by the various intermediate buffer types. By sharing a common buffer abstraction, we could maximize the benefits of pooling and avoid introducing additional allocations while minimally changing any intermediate buffer processing logic.
Our solution was bufferv2. Bufferv2 is a non-contiguous, reference counted, pooled, copy-on-write, buffer-like data structure.
Internally, a bufferv2 Buffer is a linked list of Views. Each View has
start/end indices and holds a pointer to a Chunk. A Chunk is a
reference-counted structure that’s allocated from a pool and holds data in a
byte slice. There are several Chunk pools, each of which allocates chunks with
different sized byte slices. These sizes start at 64 and double until 64k.
{:width="100%"}
The design of bufferv2 has a few key advantages over simpler object pooling:
Shifting Netstack to bufferv2 came with some costs. To start, rewriting all
buffers to use bufferv2 was a sizable effort that took many months to fully roll
out. Any place in Netstack that allocated or used a byte slice needed to be
rewritten. Reference counting had to be introduced so all the aforementioned
intermediate buffer types (PacketBuffer, segment, etc) could accurately
track buffer lifetimes, and tests had to be modified to ensure reference
counting correctness.
In addition to the upfront cost, the shift to bufferv2 also increased the engineering complexity of future Netstack changes. Netstack contributors must adhere to new rules to maintain memory safety and maximize the benefits of pooling. These rules are strict – there needs to be strong justification to break them. They are as follows:
NewView() instead.View for simple data operations (e.g writing some data of a fixed
size) and a Buffer for more complex I/O operations (e.g appending data of
variable size, merging data, writing from an io.Reader).View as a byte slice, use
View.AsSlice(). If you need access to the contents of a Buffer as a byte
slice, consider refactoring, as this will cause an allocation.View.AsSlice(); they are
still owned by the view.Bufferv2 is enabled by default as of gVisor 20221017, and will be rolling out to GKE Sandbox soon, so no action is required to see a performance boost. Network-bound workloads, such as web servers or databases like Redis, are the most likely to see benefits. All the code implementing bufferv2 is public here, and contributions are welcome! If you’d like to run the iperf benchmark for yourself, you can run:
make run-benchmark BENCHMARKS_TARGETS=//test/benchmarks/network:iperf_test \
RUNTIME=your-runtime-here BENCHMARKS_OPTIONS=-test.benchtime=60s
in the base gVisor directory. If you experience any issues, please feel free to let us know at gvisor.dev/issues.