README.md
The goal of this project is to collect software, numbers, and techniques to quickly estimate the expected performance of systems from first-principles. For example, how quickly can you read 1 GB of memory? By composing these resources you should be able to answer interesting questions like: how much storage cost should you expect to pay for logging for an application with 100,000 RPS?
The best introduction to this skill is through my talk at SRECON.
The best way to practise napkin math in the grand domain of computers is to work on your own problems. The second-best is to subscribe to this newsletter where you'll get a problem every few weeks to practise on. It should only take you a few minutes to solve each one as your facility with these techniques improve.
The archive of problems to practise with are here. The solution will be in the following newsletter.
Below are numbers rounded for memorization, not faux precision.
The rows this repo can currently refresh on a single host were re-measured and
revalidated on fresh GCP c4-standard-48-lssd instances on March 8, 2026
(Intel Xeon 6985P-C, 48 vCPU / 24 physical cores, 180 GB RAM, Ubuntu 22.04.5
LTS).
Note 1: Some throughput and latency numbers don't line up, this is intentional for ease of calculations.
Note 2: Take the numbers with a grain of salt. E.g. for I/O, fio is
the state-of-the-art. I am continuously updating these numbers as I learn more
to improve accuracy and as hardware improves.
| Operation | Latency | Throughput | 1 MiB | 1 GiB |
|---|---|---|---|---|
| Sequential Memory R/W (64 bytes) | 0.5 ns | |||
| ├ Single Thread | 20 GiB/s | 50 μs | 50 ms | |
| ├ Threaded | 200 GiB/s | 5 μs | 5 ms | |
| Network Same-Zone | 10 GiB/s | 100 μs | 100 ms | |
| ├ Inside VPC | 10 GiB/s | 100 μs | 100 ms | |
| ├ Outside VPC | 3 GiB/s | 300 μs | 300 ms | |
| Hashing, not crypto-safe (64 bytes) | 10 ns | 5 GiB/s | 200 μs | 200 ms |
| Random Memory R/W (64 bytes) | 20 ns | 3 GiB/s | 300 μs | 300 ms |
Fast Serialization [8] [9] † | N/A | 1 GiB/s | 1 ms | 1s |
Fast Deserialization [8] [9] † | N/A | 1 GiB/s | 1 ms | 1s |
| System Call | 300 ns | N/A | N/A | N/A |
| Hashing, crypto-safe (64 bytes) | 100 ns | 1 GiB/s | 1 ms | 1s |
| Sequential SSD read (8 KiB) | 1 μs | 8 GiB/s | 100 μs | 100 ms |
Context Switch [1] [2] | 10 μs | N/A | N/A | N/A |
| Sequential SSD write, -fsync (8KiB) | 2 μs | 3 GiB/s | 300 μs | 300 ms |
| TCP Echo Server (32 KiB) | 50 μs | 500 MiB/s | 2 ms | 2s |
Decompression [11] | N/A | 1 GiB/s | 1 ms | 1s |
Compression [11] | N/A | 500 MiB/s | 2 ms | 2s |
| Sequential SSD write, +fsync (8KiB) | 300 μs | 30 MiB/s | 30 ms | 30s |
| Sorting (64-bit integers) | N/A | 500 MiB/s | 2 ms | 2s |
| Sequential HDD Read (8 KiB) | 10 ms | 250 MiB/s | 2 ms | 2s |
| Blob Storage GET, if not match | 30 ms | |||
| Blob Storage GET, 1 conn (128KiB) | 50 ms | 100 MiB/s | 10 ms | 10s |
| Blob Storage GET, n conn (offsets) | 50 ms | NW limit | ||
| Blob Storage PUT, 1 conn (128KiB) | 100 ms | 100 MiB/s | 10 ms | 10s |
| Blob Storage PUT, n conn (multipart) | 150 ms | NW limit | ||
| Blob Storage PUT, CAS (8 KiB) | 100 ms | |||
| Random SSD Read (8 KiB) | 100 μs | 70 MiB/s | 15 ms | 15s |
Serialization [8] [9] † | N/A | 100 MiB/s | 10 ms | 10s |
Deserialization [8] [9] † | N/A | 100 MiB/s | 10 ms | 10s |
| Proxy: Envoy/ProxySQL/Nginx/HAProxy | 50 μs | ? | ? | ? |
| Network within same region | 250 μs | 2 GiB/s | 500 μs | 500 ms |
| Premium network within zone/VPC | 250 μs | 25 GiB/s | 50 μs | 40 ms |
| {MySQL, Memcached, Redis, ..} Query | 500 μs | ? | ? | ? |
| Random HDD Read (8 KiB) | 10 ms | 0.7 MiB/s | 2 s | 30m |
Network between regions [6] | Varies | 25 MiB/s | 40 ms | 40s |
| Network NA Central <-> East | 25 ms | 25 MiB/s | 40 ms | 40s |
| Network NA Central <-> West | 40 ms | 25 MiB/s | 40 ms | 40s |
| Network NA East <-> West | 60 ms | 25 MiB/s | 40 ms | 40s |
| Network EU West <-> NA East | 80 ms | 25 MiB/s | 40 ms | 40s |
| Network EU West <-> NA Central | 100 ms | 25 MiB/s | 40 ms | 40s |
| Network NA West <-> Singapore | 180 ms | 25 MiB/s | 40 ms | 40s |
| Network EU West <-> Singapore | 160 ms | 25 MiB/s | 40 ms | 40s |
†: "Fast serialization/deserialization" is typically a simple wire-protocol that just dumps bytes, or a very efficient environment. Typically standard serialization such as e.g. JSON will be of the slower kind. We include both here as serialization/deserialization is a very, very broad topic with extremely different performance characteristics depending on data and implementation.
For the active Criterion suite, run ./run --bench napkin_math to get the
right optimization levels and Linux tuning. You won't get the right numbers
when you're compiling in debug mode. The wrapper already uses sudo
internally. On locked-down cloud images, run
sudo sysctl -w kernel.perf_event_paranoid=-1 once before invoking it. You
can help this project by adding new suites and filling out the blanks.
Note: The active benchmark path today is Criterion.rs in benches/.
src/main.rs is still the older ad hoc harness and remains the source of truth
for the benches that have not been fully migrated and revalidated yet. The
current Criterion suite now includes memory_read, memory_random, hash,
syscall, sort, serialization, compression, and
compressed_memory_read. The current SSD rows were refreshed from the older
harness with NAPKIN_BENCH_FILE pointed at a RAID0 local-SSD mount.
The compressed_memory_read Criterion bench is a BitPacker integer-unpack
microbenchmark; it should not be used to rewrite the generic [11]
compression/decompression rows above. The new serialization and
compression Criterion groups are workload-specific and are not yet wired into
the generic README rows above.
memory_read now emits explicit No SIMD and SIMD variants in Criterion,
but the README intentionally collapses them to one single-thread row and one
threaded row for memorability.
I am aware of some inefficiencies in this suite. I intend to improve my skills in this area, in order to ensure the numbers are the upper-bound of performance you may be able to squeeze out in production. I find it highly unlikely any of them will be more than 2-3x off, which shouldn't be a problem for most users.
Approximate numbers that should be consistent between Cloud providers.
| What | Amount | $ / Month | 1y commit $ /month | Spot $ /month | Hourly Spot $ |
|---|---|---|---|---|---|
| CPU | 1 | $15 | $10 | $2 | $0.005 |
| GPU | 1 | $5000 | $3000 | $1500 | $2 |
| Memory | 1 GB | $2 | $1 | $0.2 | $0.0005 |
| Storage | |||||
| ├ Warehouse Storage | 1 GB | $0.02 | |||
| ├ Blob (S3, GCS) | 1 GB | $0.02 | |||
| ├ Zonal HDD | 1 GB | $0.05 | |||
| ├ Ephemeral SSD | 1 GB | $0.08 | $0.05 | $0.05 | $0.07 |
| ├ Regional HDD | 1 GB | $0.1 | |||
| ├ Zonal SSD | 1 GB | $0.2 | |||
| ├ Regional SSD | 1 GB | $0.35 | |||
| Networking | |||||
| ├ Same Zone | 1 GB | $0 | |||
| ├ Blob | 1 GB | $0 | |||
| ├ Ingress | 1 GB | $0 | |||
| ├ L4 LB | 1 GB | $0.008 | |||
| ├ Inter-Zone | 1 GB | $0.01 | |||
| ├ Inter-Region | 1 GB | $0.02 | |||
| ├ Internet Egress † | 1 GB | $0.1 | |||
| CDN Egress | 1 GB | $0.05 | |||
| CDN Fill ‡ | 1 GB | $0.01 | |||
| Warehouse Query | 1 GB | $0.005 | |||
| Logs/Traces ♣ | 1 GB | $0.5 | |||
| Metrics | 1000 | $20 | |||
| EKM Keys | 1 | $1 |
† This refers to network leaving your cloud provider, e.g. sending data to S3 from GCP or egress network for sending HTML from AWS to a client.
‡ An additional per cache-fill fee is incurred that costs close to blob storage write costs (see just below).
7 This is standard pricing among a few logging providers, but e.g. Datadog pricing is different and charges $0.1 per ingested logs with $1.5 per 1m on top for 7d retention.
Furthermore, for blob storage (S3/GCS/R2/...), you're charged per read/write operation (fewer, large files is cheaper):
| 1M | 1000 | |
|---|---|---|
| Reads | $0.4 | $0.0004 |
| Writes | $5 | $0.005 |
| EKM Encryption | $3 | $0.003 |
This is sourced from a few sources. [3] [4] [5] Note that compression speeds (but
generally not ratios) vary by an order of magnitude depending on the algorithm
and the level of compression (which trades speed for compression).
I typically ballpark that another x in compression ratio decreases performance by 10x. E.g. we can get a 2x ratio on English Wikipedia at ~200 MiB/s, and 3x at ~20MiB/s, and 4x at 1MB/s.
| What | Compression Ratio |
|---|---|
| HTML | 2-3x |
| English | 2-4x |
| Source Code | 2-4x |
| Executables | 2-3x |
| RPC | 5-10x |
| SSL | -2% [10] |
c * 10^e. Your goal is to
get within an order of magnitude right--that's just e. c matters a lot
less. Only worrying about single-digit coefficients and exponents makes it
much easier on a napkin (not to speak of all the zeros you avoid writing).[1]: https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/[2]: https://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html[3]: https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf[4]: https://github.com/google/snappy[5]: https://quixdb.github.io/squash-benchmark/[6]: https://dl.acm.org/doi/10.1145/1879141.1879143[7]: https://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics#Seek_times_&_characteristics[8]: https://github.com/simdjson/simdjson#performance-results[9]: https://github.com/protocolbuffers/protobuf/blob/d20e9a92/docs/performance.md[10]: https://www.imperialviolet.org/2010/06/25/overclocking-ssl.html[11]: https://github.com/inikep/lzbenchtoplev to find the bottlenecks. This is particularly
useful for the benchmarking suite we have here, to ensure the programs are
correctly written (I have not taken them through this yet, but plan to).toplev.