Back to Tensorzero

Optimize latency and throughput

docs/deployment/optimize-latency-and-throughput.mdx

2026.4.11.9 KB
Original Source

The TensorZero Gateway is designed from the ground up with performance in mind. Even with default settings, the gateway is fast and lightweight enough to be unnoticeable in most applications. The best practices below are designed to help you optimize the performance of the TensorZero Gateway for production deployments requiring maximum performance.

<Tip>

The TensorZero Gateway can achieve <1ms P99 latency overhead at 10,000+ QPS. See Benchmarks for details.

</Tip>

Best practices

Observability data collection strategy

By default, the gateway uses async_writes to write observability data asynchronously, returning the response to the client immediately without waiting for database writes to complete. Each database insert is handled immediately in separate background tasks.

For high-throughput applications, you can use gateway.observability.batch_writes instead, which collects multiple records and writes them together in batches for more efficient writes.

If you need strict data durability guarantees (ensuring data is persisted in the database before sending a response), you can disable async writes by setting gateway.observability.async_writes = false.

As a rule of thumb, consider the following decision matrix:

High throughputLow throughput
Latency is criticalbatch_writesasync_writes (default)
Latency is not criticalbatch_writesSynchronous writes

See the Configuration Reference for more details.

Other recommendations

  • Ensure your application, the TensorZero Gateway, and database are deployed in the same region to minimize network latency.
  • Initialize the client once and reuse it as much as possible, to avoid initialization overhead and to keep the connection alive.