docs/PerformanceIterationLog.md
Running notes for the multi-iteration performance work on the UDP relay data path. Pick this up to continue without re-deriving everything.
The harness, baseline command, and droplet topology are documented in CLAUDE.md under "Load Test on DigitalOcean" — this file captures the deltas: what was measured, what landed, what didn't, and where the next round should go.
Five commits on claude/beautiful-black-c3b741 between 727ec2ab
("loadgen") and 321a2d18:
| # | Commit | Optimization |
|---|---|---|
| 1 | ce7e7e53 | Hoist turn_server_get_engine() out of per-packet hot path |
| 2 | 8e28491a | ioa_socket_check_bandwidth early fast-exit; drop dead if (!(s->done || s->fd==-1)) in send_data_from_ioa_socket_nbh |
| 3 | 344360f6 | Cache get_relay_socket_ss() and ioa_network_buffer_get_size() in write_to_peerchannel, handle_turn_send, read_client_connection |
| 4 | a6f6767f | Inline get_ioa_addr_len() via ns_turn_ioaddr.h |
| 5 | 321a2d18 | Inline addr_cpy() via ns_turn_ioaddr.h |
Current relay-recvmmsg follow-up:
| # | Commit | Optimization |
|---|---|---|
| 6 | 54c589d0 / 4b1a8d71 | Initial Linux recvmmsg batching for UDP listener and connected relay sockets |
| 7 | 8d9a7292 | Share the existing --udp-recvmmsg flag across listener and relay UDP paths; remove separate relay flag; use the shared ancillary-data parser in dtls_listener |
| 8 | d48686b7 | Reduce relay per-socket recvmmsg state from 16 x 64 KiB cmsg buffers to TTL/TOS-sized buffers, avoid an extra would-block fallback recvmsg, and clean up all preallocated buffers after partial batches |
| 9 | ad81705e | Add per-engine recvmmsg occupancy counters and 10 s log summaries (calls, packets, avg_batch, wouldblock, unavailable, no_buffer, batch-size histogram) |
| 10 | 388b15d4 | Move connected relay UDP recvmmsg scratch from per-socket state to per-engine/per-thread state |
| 11 | 4c4fd67e | Make the occupancy summaries opt-in behind --udp-recvmmsg-log, so --udp-recvmmsg can ship without periodic stats logs |
Validation after #7-#11:
cmake -S . -B build -DBUILD_TESTING=ON passed.cmake --build build --parallel 8 passed.ctest --test-dir build --output-on-failure passed 3/3.build/bin/turnserver --udp-recvmmsg --udp-recvmmsg-log --version
parsed both flags and printed 4.11.0.turnserver build passed after #7, after #8, and after #10.Shipping cleanup learning: keep the occupancy counters in place because they
are low overhead and useful for DigitalOcean diagnostics, but keep the periodic
summaries off by default. Use --udp-recvmmsg-log only during measured runs
where the log stream is part of the observation.
DigitalOcean check on 2026-05-09:
c-4 droplets in nyc1: turnserver public
157.230.3.102, private 10.116.0.2; loadgen public 167.99.153.216,
private 10.116.0.3. Droplets were left running between steps.d48686b7 on both droplets under
/root/coturn_recvmmsg_current.--udp-recvmmsg off/on, -Y packet -m 1 -l 120, 5 alternating
30 s rounds each:
--udp-recvmmsg off/on, -Y packet -m 100 -l 120, 5 alternating
rounds each. The client completed before the 30 s timeout and landed in two
send-volume buckets, so treat this as a coarse many-connection signal:
m=100 -n 1000 run, 3 alternating rounds each, derived receive
count from tot_recv_bytes / 120 because this log format omits
tot_recv_msgs:
Learning: the corrected relay recvmmsg implementation is now buildable and
much safer for many connections, but these droplet runs still do not show a
clear throughput win. Keep --udp-recvmmsg opt-in for now. The next useful
step is to instrument actual batch occupancy on connected relay sockets; if
most readiness events return one datagram, recvmmsg will mostly add setup
work without reducing syscalls.
DigitalOcean occupancy check on 2026-05-09:
388b15d4 on both droplets under
/root/coturn_recvmmsg_current.--udp-recvmmsg off/on, -Y packet -m 1 -l 120, 3 alternating
30 s rounds each:
m=1 occupancy from the on runs: 1,129,427 recvmmsg calls returned
17,660,300 packets, average batch 15.64. Histogram buckets:
hist_1=1,353, hist_2=1,496, hist_3_4=3,707,
hist_5_8=14,817, hist_9_16=1,108,057; 98.1 % of calls were in the
9..16 bucket.--udp-recvmmsg off/on, -Y packet -m 100 -l 120, 3 alternating
runs each:
m=100 occupancy from the on runs across all relay threads: 1,426,401
recvmmsg calls returned 16,188,946 packets, average batch 11.35.
Histogram buckets: hist_1=83,057, hist_2=79,781,
hist_3_4=130,066, hist_5_8=188,259, hist_9_16=945,238; 66.3 %
of calls were in the 9..16 bucket.Learning: receive-side occupancy is high. The earlier hypothesis that
recvmmsg was mostly returning one packet is wrong for this harness. The
remaining bottleneck is after receive: per-packet callbacks, TURN processing,
and especially one sendto per relayed packet. The per-thread scratch change
is still worth keeping for memory/cache behavior with thousands of sockets,
but the next performance lever should be send-side batching or a design that
passes batches deeper instead of immediately decomposing them back into
single-packet callbacks.
Alternating A/B run on the same droplet pair, m=1 packet flood, 30 s per run, with a 4 s warm-up between binary swaps:
master binary): mean 146,984 round-trips / 30 sPer-iteration deltas were within run-to-run noise (~5–10 % variance). The cumulative effect is what's visible.
Two c-4 Ubuntu 24.04 droplets in nyc1, same VPC default-nyc1.
Current active pair:
coturn-turnserver — public 157.230.3.102, private 10.116.0.2coturn-loadgen — public 167.99.153.216, private 10.116.0.3Older pair used for the iter 5 cumulative run:
coturn-turnserver — public 68.183.121.197, private 10.116.0.2coturn-loadgen — public 68.183.132.220, private 10.116.0.3Created via the DigitalOcean v2 API (doctl is not installed; use
curl + $DIGITALOCEAN_TOKEN from the user's ~/.zshrc). SSH via
~/.ssh/id_rsa (matches DO ssh key id 23704483, fingerprint
37:3a:9b:e3:1e:1a:9b:42:a0:6f:58:f5:5a:3a:6a:2c).
State on the turnserver droplet (kept across iterations):
/root/coturn_clean.tar — git archive HEAD of master at start of run.
Re-extract this before applying any new patch./root/coturn_baseline/build/bin/turnserver — clean baseline binary,
used as the "B" in every A/B round. Don't overwrite./root/coturn/build/bin/turnserver — current iteration binary./root/start_turnserver.sh, /root/baseline_run.sh — helper scripts.State on the loadgen droplet:
/root/coturn/build/bin/turnutils_uclient, turnutils_peer.turnutils_peer runs as a daemon on 10.116.0.3:3480
(pid in /root/peer.pid).A small env file was written to /tmp/coturn_perf_env.sh on the local
machine with the IPs / droplet IDs — recreate it from the current
state of the DO account if it gets lost.
The standard packet-flood command (matches CLAUDE.md baseline, runs without
--udp-recvmmsg; add --udp-recvmmsg to turnserver, not the client, for the
batched listener/relay receive path):
timeout -s INT 30s /root/coturn/build/bin/turnutils_uclient \
-Y packet -m 1 -l 120 \
-e 10.116.0.3 -r 3480 -X -g \
-u user -W secret \
10.116.0.2
Metric: the tot_recv_msgs field on the last start_mclient: log
line. (This is round-trips through the relay over the test window;
send_pps is loadgen-side only and can hit 262 K even when the relay
is dropping most of them, so it's not a useful proxy for relay
throughput.)
perf record -F 99 -g on the turnserver during a 12 s -Y packet -m 1
run, sorted by user-space self-time:
0.80 % send_data_from_ioa_socket_nbh
0.76 % socket_input_worker
0.69 % read_client_connection.isra.0
0.60 % turn_report_session_usage
0.53 % peer_input_handler
0.51 % udp_server_input_handler
0.35 % udp_recvfrom # was 0.76 % at iter 1
0.34 % lm_map_get
0.27 % stun_is_channel_message_str
0.27 % get_relay_socket
0.26 % ioa_socket_check_bandwidth # was 0.33 % at iter 1
0.26 % udp_send # was 0.60 % at iter 1
0.18 % ioa_network_buffer_get_size
Total user-space coturn cycles: ~5–7 % of the relay thread. The relay thread sits at ~100 % CPU pinned to one core; the 4 relay threads aren't parallelised by the m=1 single-flow test (one 5-tuple hashes to one SO_REUSEPORT worker).
Kernel side (children-aggregated) is the real cost:
36 % udp_sendmsg (sendto path)
14 % udp_recvmsg
17 % ip_finish_output / ip_output / __dev_queue_xmit
~23 % syscall enter / exit machinery (sysret, SYSRETQ, SYSCALL_64*)
That ~23 % syscall overhead is the next big lever. Halving it (via batching) is worth ~10 % wall-clock CPU.
--udp-recvmmsg=true on Linux (tried in iter 1, kept opt-in)The flag now covers both the unconnected listener socket in dtls_listener.c and connected plain-UDP relay sockets in ns_ioalib_engine_impl.c. DTLS session sockets remain on the SSL read path and are not batched by the relay socket helper.
Throughput parity or slight negative results were confirmed across multiple
A/B rounds on m=1 and m=100; keep this opt-in until batch occupancy
instrumentation proves that real deployments commonly receive multiple queued
datagrams per connected socket readiness event.
get_relay_socket_ss (iter 3) — no measurable wall-clock winThe function is static inline already and the underlying
get_relay_socket() is a four-line accessor. Caching the result
does save a cross-TU function call per packet (the compiler can't
prove get_relay_socket pure across the
set_df_on_ioa_socket / ioa_network_buffer_* calls in between),
which the perf profile picked up as a small redistribution, but
throughput stayed in the noise band. Kept anyway: the cleanup is
defensible and matches the iter 4/5 inlining direction.
tot_recv_msgs field, not send_pps. Loadgen send rate
saturates at ~262 K pps regardless of relay capacity — it's
whatever the loadgen kernel will accept into its UDP send buffer.
The receive count is what made it round-trip through the relay.SO_REUSEPORT the kernel
hashes 5-tuples to worker sockets; one client → one tuple → one
worker thread. The other 3 cores sit idle. To exercise all 4 relay
threads you'd need m≥4 with distinct source ports — ours don't
spread cleanly because the loadgen reuses ports./root/coturn between iterations if you want
to keep git apply-style patches working. The droplet copy is not
a git checkout (it's the git archive tar). Use patch -p1. Each
iteration uploaded a cumulative diff (current branch vs master)
and re-extracted from /root/coturn_clean.tar first to get a clean
apply.Ordered by expected impact for the m=1 packet-flood metric:
Batch the send side (sendmmsg) or pass receive batches deeper. The
occupancy counters show receive batching is already working: m=1 averaged
15.6 packets per call and m=100 averaged 11.4. The code immediately
invokes the existing per-packet callback for each received datagram, and
each forwarded packet still pays a separate send syscall. The next
measurable lever is to queue per-thread outbound datagrams during a receive
batch and flush them with sendmmsg, or introduce a batch-aware callback
path for the hot UDP relay case.
Keep recvmmsg occupancy counters available while developing send
batching. They are cheap enough for targeted performance builds and make
it obvious whether a benchmark is exercising one relay thread or all relay
threads. Consider hiding periodic logs behind a verbose/debug option before
shipping broadly.
GSO (UDP_SEGMENT) on the send path. Linux can take one
"large" datagram and segment it in the kernel for back-to-back
packets to the same destination. Our channel-data flood IS
same-destination. Setting UDP_SEGMENT and submitting a single
sendmsg of N×packet_size cuts skb-alloc / __dev_queue_xmit
work substantially. Needs careful handling for short tails and
non-uniform sizes; complementary to (2).
Inline more cross-TU per-packet accessors. Pattern from iter
4/5 still applies: addr_eq (called per channel-data packet for
permission lookup), ioa_network_buffer_get_size,
get_ioa_socket_type / _app_type. Each is small enough; the
only reason to be cautious is they're declared in ns_turn_ioalib.h
which is part of the public-ish server library API — moving the
body inline doesn't break ABI but does require a recompile of all
consumers. Likely <1 % each but cheap to do.
Re-evaluate --udp-recvmmsg default after instrumentation. The current
measurements do not justify default-on. Revisit only if production-like
traces show frequent batch sizes above one and no latency/memory downside.
set_socket_ttl / set_socket_tos already short-circuit on
no-change via s->current_ttl != ttl / s->current_tos != tos.
In a steady-state flood the per-packet call returns immediately
without setsockopt. Already optimized.set_df_on_ioa_socket similarly guarded
(ns_ioalib_engine_impl.c:242).turn_report_session_usage slow path runs once per 4096 packets
(see iter 1 commit); the per-call overhead is now ~3 reads + 1
bitmask test + 1 conditional return.MSG_CONFIRM in sendto would skip ARP refresh, but
neigh_resolve_output + neigh_hh_output show ~17 % combined in
perf only because we're sending that many packets — per-packet
it's the normal cached neighbor path, not a refresh.MAX_TRIES from 16 to 64 in socket_input_worker
doesn't change syscall count; it only delays returning to libevent.
Useless without (1) above.c-4 / nyc1 / default-nyc1 VPC and
the pavel SSH key (id 23704483)./tmp/coturn_clean.tar from git archive master and
rebuild /root/coturn_baseline/build/bin/turnserver if the
baseline binary is gone. The A/B harness depends on having both
binaries side-by-side on the turnserver droplet.master by ~5 %. If it doesn't, the
environment drifted and the baseline needs re-anchoring.recvmmsg into
socket_input_worker — is where the next material gain lives.A later run on two DigitalOcean CPU-optimized c-4 droplets in sfo3
(10.124.0.2 turnserver, 10.124.0.3 loadgen) tested an experimental
Linux-only --udp-sendmmsg flag with --udp-recvmmsg.
| Run | Code/flags | Generator max pps | Generator avg pps | Server RX avg pps | Server TX avg pps | Server TX peak pps | CPU avg | Perf conclusion |
|---|---|---|---|---|---|---|---|---|
| iter0 | baseline, --udp-recvmmsg | 335,872 | 286,721 | 360,900 | 257,357 | 323,488 | 97.8% | sendto/udp_sendmsg dominates |
| iter1 | --udp-sendmmsg, both directions | 409,600 | 312,662 | 428,184 | 197,300 | 260,453 | 99.8% | sendmmsg path dominates; TX regressed |
| iter2 | sendmmsg only for batches >= 4 | 393,216 | 315,393 | 398,121 | 163,626 | 215,068 | 98.9% | Threshold did not recover TX |
| iter3 | listener-side batching only | 425,984 | 286,038 | 376,444 | 210,050 | 332,417 | 97.4% | Peak ingress/TX improved, average TX still below baseline |
Validation result: sendmmsg() is not a proven general win for this workload.
It can increase generator max pps and peak server TX, but average delivered
server TX stayed below the --udp-recvmmsg baseline. Keep it opt-in until a
follow-up change proves better end-to-end relay throughput.
Perf still points at per-datagram kernel transmit cost:
udp_send -> sendto -> __sys_sendto -> udp_sendmsg -> udp_send_skb -> ip_outputudp_sendmmsg_flush -> __sendmmsg -> __sys_sendmmsg -> ___sys_sendmsg -> udp_sendmsg -> ip_outputThe key observation is that sendmmsg() reduces syscall entry count but still
walks udp_sendmsg and the IP output path once per datagram. On this workload,
the extra mmsghdr copy/looping overhead can offset the syscall savings.
Deferred bigger refactors from this run:
io_uring send batching or kernel-bypass style transmit only as
a larger architecture experiment.--udp-gso)Realizes the GSO backlog item from the iter-5 backlog above. The recvmmsg /
sendmmsg follow-ups confirmed that on this workload the dominant cost is the
per-datagram kernel TX path (udp_sendmsg → ip_finish_output → __dev_queue_xmit → start_xmit), which mmsg-style batching does not collapse. UDP-GSO (Linux
UDP_SEGMENT cmsg) does collapse it: N same-destination, same-size datagrams
are submitted as one sendmsg carrying an iovec; the kernel allocates one
super-skb that traverses the network stack once and is split at egress (NIC).
Implementation lives in src/apps/relay/ns_ioalib_engine_impl.c
and reuses the existing --udp-sendmmsg batch state. Eligibility (same fd,
same dest, same size, ≤ 1472 B per datagram) is tracked on every
udp_sendmmsg_enqueue; eligible flushes go through udp_gso_attempt_flush
ahead of the sendmmsg loop, with an automatic sticky disable on
EINVAL/ENOPROTOOPT so a kernel/NIC without GSO support gracefully falls back.
The relay-side socket_udp_read_batch_recvmmsg now wraps its callback loop
in udp_sendmmsg_batch_begin/end so peer→client sends triggered inside a
recvmmsg batch can also coalesce — without that wrapping, the relay path
issues one sendto per delivered datagram.
DigitalOcean validation on 2026-05-09 — fresh nyc1 c-4 droplets (turn
10.116.0.4, load 10.116.0.5), all variants built from the same source tree
under /root/coturn/build, -Y packet -m 1 -l 120, monitor window via sar -n DEV for eth1, mpstat, pidstat. The 12 s sweep first established the
ordering, then a 30 s alternating A/B (baseline → gso → baseline → gso)
confirmed the magnitude of the delta:
| Variant | eth1 RX pps | eth1 TX pps | sys CPU | idle CPU |
|---|---|---|---|---|
| baseline_r1 | 322,091 | 127,445 | 22.9% | 67.5% |
--udp-recvmmsg --udp-sendmmsg --udp-gso (gso_r1) | 266,068 | 257,996 | 15.0% | 78.7% |
| baseline_r2 | 309,475 | 125,573 | 20.9% | 70.7% |
| gso_r2 | 275,992 | 225,366 | 14.9% | 74.3% |
Mean server forwarding rate (eth1 TX): baseline 126,509 pps → GSO 241,681 pps, +91 % (1.91×), with mean system CPU dropping from 21.9 % to 14.9 % — about 2.8× CPU efficiency in TX pps per system-CPU-%.
12 s packet sweep, all four variants, mean send_pps reported by uclient (used only for ordering — for absolute throughput trust eth1 TX above):
| Variant | m=1 | m=2 | m=4 | m=8 | m=16 | m=32 |
|---|---|---|---|---|---|---|
| baseline | 230,401 | 150,189 | 187,055 | 174,771 | 160,871 | 167,789 |
--udp-recvmmsg | 255,660 | 148,824 | 174,767 | 142,997 | 150,743 | 144,200 |
--udp-recvmmsg --udp-sendmmsg | 231,766 | 146,776 | 148,826 | 136,542 | 148,955 | 143,575 |
--udp-recvmmsg --udp-sendmmsg --udp-gso | 136,876 | 147,458 | 124,250 | 131,081 | 137,636 | 114,714 |
The uclient generator reports its own send rate, which drops with GSO because
the loadgen droplet's turnutils_peer becomes the new bottleneck — it is
single-threaded and cannot reflect 240 k pps. The 30 s eth1 capture is the
authoritative server-side metric; sweep_m1 is retained only to show that
GSO does not regress in the moderately-loaded m=2..32 range relative to
recvmmsg+sendmmsg.
Perf children share, m=1 12 s perf record on the turnserver process:
| Symbol | baseline | recvmmsg | recvsendmmsg | gso |
|---|---|---|---|---|
__x64_sys_sendto (children) | 43.6 % | 47.6 % | 22.8 % | 0.0 % |
__x64_sys_sendmsg (children) | — | — | — | 38.1 % |
__x64_sys_sendmmsg (children) | — | — | 27.0 % | 0.0 % |
udp_sendmsg | 38.8 % | 41.9 % | 20.6 % | 35.9 % |
__dev_queue_xmit | 18.5 % | — | — | 29.3 % |
skb_segment (egress GSO split) | absent | absent | absent | 2.2 % |
syscall_return_via_sysret (self) | 7.2 % | 4.7 % | 4.4 % | 2.4 % |
entry_SYSCALL_64_after_hwframe (self) | 4.1 % | 3.6 % | 2.6 % | 1.8 % |
In the GSO column the per-packet kernel-stack cost is now amortized across
the segments of a single super-skb. The proportional rise of
__dev_queue_xmit is misleading on its own — it reflects a smaller
denominator (CPU usage dropped) while the per-packet absolute cost dropped.
Operational notes:
--udp-gso requires --udp-sendmmsg; without that flag
the batch state never accumulates and GSO has nothing to flush. The
--help text states the dependency._begin/_end. Mixed-destination or
mixed-size workloads transparently fall back through the existing
sendmmsg and udp_send paths.EINVAL/ENOPROTOOPT keeps a process running on an
un-virtio host or older kernel from hot-looping in the sticky failure
path. A WARNING line is logged once.c-4), gso_max_segs=65535. Older
hosts (kernel <4.18) lack UDP_SEGMENT entirely; the sticky-disable
path covers them.Suggested next levers if more relay throughput is needed:
pktgen-style reflector would let us
measure the real ceiling.route_lookup per send.MSG_ZEROCOPY on the GSO sendmsg. rep_movs_alternative is still
3 % self in GSO, and zerocopy avoids the userspace→kernel copy.
Probably small for 32-B STUN packets; revisit when payloads are larger.Artifacts (perf.data, sar/mpstat/pidstat, sweep logs, AB logs) are saved at
perf-results-20260508-213056/ in the worktree.