INTERNODE_DATA_TRANSPORT_RFC.md
Status: draft Last updated: 2026-05-19 Scope: internode data-path analysis, benchmark baseline, and transport boundary
RustFS does not currently include RDMA, RoCE, InfiniBand, DPU, BlueField/DOCA, DPDK, SPDK, or SmartNIC offload support. The current distributed internode paths use TCP-based HTTP/gRPC transports:
tonic gRPC NodeService for most control, metadata, lock, health, and
peer operations./rustfs/rpc/ for remote disk file streams.RDMA/RoCE is still a plausible future optimization for large internode disk data transfers, but it should not replace the whole internode RPC surface. The correct first step is to isolate the data plane, establish a TCP baseline, and introduce a pluggable transport boundary only around high-volume streams.
tonic gRPC for control-plane RPCs.The main HTTP server builds a hybrid service per connection:
rustfs/src/server/http.rs wires a NodeServiceServer for gRPC.rustfs/src/storage/rpc/InternodeRpcService intercepts HTTP paths under
/rustfs/rpc/.Compression logic already treats /rustfs/rpc/ and /rustfs/peer/ as internode
RPC paths and skips normal response compression for them.
crates/protos/src/lib.rs creates internode gRPC channels with tonic
Endpoint:
This confirms the current gRPC transport is TCP/HTTP2-based.
crates/protos/src/node.proto defines one NodeService that mixes several
classes of RPCs:
The service layout is practical today, but it is too broad to become an RDMA surface. A future high-throughput transport should target only disk data streams and keep this gRPC service as the control plane.
These paths carry coordination, metadata, health, and administrative state. They should remain on gRPC/TCP:
| Area | Client/server code | Examples | Notes |
|---|---|---|---|
| Bucket peer ops | crates/ecstore/src/rpc/peer_s3_client.rs, rustfs/src/storage/rpc/bucket.rs | MakeBucket, ListBucket, DeleteBucket, GetBucketInfo, HealBucket | Small metadata/control payloads. |
| Locking | crates/ecstore/src/rpc/remote_locker.rs, rustfs/src/storage/rpc/lock.rs | Lock, UnLock, Refresh, batch lock/unlock | Latency-sensitive but not bulk data; correctness and timeout semantics matter more than transport bandwidth. |
| Peer/admin state | crates/ecstore/src/rpc/peer_rest_client.rs, rustfs/src/storage/rpc/health.rs, metrics.rs, event.rs | LocalStorageInfo, ServerInfo, GetMetrics, GetLiveEvents, reload APIs, rebalance APIs | Operational control plane. |
| Disk metadata/control | crates/ecstore/src/rpc/remote_disk.rs, rustfs/src/storage/rpc/disk.rs | DiskInfo, ReadXL, ReadVersion, ReadMetadata, WriteMetadata, RenameFile, RenamePart, Delete*, VerifyFile, CheckParts | Usually metadata, integrity checks, or namespace mutations. |
| Connection health | RemoteDisk, RemotePeerS3Client, PeerRestClient | TCP connectivity probes and fault/recovery state | Must remain available even if an optional data backend is unavailable. |
These paths move object shard bytes or stream potentially large disk data and are the only reasonable first candidates for a pluggable transport.
| Priority | Path | Current client | Current server | Current transport | Why it matters |
|---|---|---|---|---|---|
| P0 | read_file_stream | RemoteDisk::read_file_stream | handle_read_file in http_service.rs | HTTP GET /rustfs/rpc/read_file_stream with a streaming response body | Main remote disk read stream used by bitrot readers and erasure reads. |
| P0 | put_file_stream | RemoteDisk::create_file and RemoteDisk::append_file | handle_put_file in http_service.rs | HTTP PUT /rustfs/rpc/put_file_stream with a streaming request body | Main remote disk write stream used by bitrot writers and erasure writes. |
| P1 | walk_dir | RemoteDisk::walk_dir | handle_walk_dir in http_service.rs | HTTP GET /rustfs/rpc/walk_dir with a streamed metadata listing | Can be high-volume during scans/healing, but it is metadata-oriented rather than object byte data. |
| P1 | ReadAll / WriteAll | RemoteDisk::read_all / write_all | gRPC unary disk handlers | gRPC unary bytes payload | Moves bytes today, but should be measured before treating it as a high-throughput data path. |
| P2 | proto WriteStream / ReadAt | currently not used | currently returns unimplemented | gRPC streaming definitions exist but are not implemented | Possible future API shape, not a current production path. |
For object PUTs in distributed erasure mode, the relevant flow is:
SetDisks selects local and remote disks.create_bitrot_writer calls disk.create_file(...) for each shard writer.RemoteDisk::create_file returns an HttpWriter.HttpWriter sends an HTTP PUT to /rustfs/rpc/put_file_stream.handle_put_file opens the local file writer and copies
incoming body chunks into it.Erasure::encode writes shards through MultiWriter to all selected
writers while enforcing write quorum.This is the primary write data-plane candidate.
For object GETs and repair reads in distributed erasure mode, the relevant flow is:
SetDisks prepares shard readers for the selected disks.create_bitrot_reader uses local zero-copy only when disk.is_local().disk.read_file_stream(...).RemoteDisk::read_file_stream returns an HttpReader.HttpReader sends an HTTP GET to /rustfs/rpc/read_file_stream.handle_read_file opens the local disk stream and returns
it as an HTTP streaming body.This is the primary read data-plane candidate.
RustFS already has coarse internode metrics in crates/common/src/internode_metrics.rs:
These metrics are useful as a starting point, but they are not enough for a transport RFC. A transport benchmark needs route-level and operation-level measurements for at least:
read_file_streamput_file_streamwalk_dirReadAll / WriteAllExisting benchmark assets:
scripts/run_object_batch_bench.shscripts/run_object_batch_bench_enhanced.shscripts/run_object_batch_bench_abc.shscripts/run_four_node_cluster_failover_bench.shscripts/run_internode_transport_baseline.sh (scenario matrix wrapper for local vs distributed TCP baseline artifacts)crates/ecstore/benches/These mostly cover S3/object workload or erasure coding performance. They do not yet isolate internode transport cost.
Before adding a transport abstraction or any RDMA backend, collect a baseline for the current TCP/HTTP/gRPC implementation.
Minimum:
Preferred:
| Workload | Sizes | Concurrency | Main signal |
|---|---|---|---|
| S3 PUT | 4 KiB, 1 MiB, 16 MiB, 128 MiB, 1 GiB | 1, 16, 64, 128 | End-to-end write throughput and tail latency. |
| S3 GET | 4 KiB, 1 MiB, 16 MiB, 128 MiB, 1 GiB | 1, 16, 64, 128 | End-to-end read throughput and tail latency. |
| Remote disk stream read | shard-sized ranges from read_file_stream | 1, 16, 64 | Isolated internode read path. |
| Remote disk stream write | shard-sized writes through put_file_stream | 1, 16, 64 | Isolated internode write path. |
| Healing / repair | missing disk or missing shard scenario | controlled | Rebuild throughput and read/write amplification. |
| Scanner walk | large bucket/object namespace | controlled | Metadata streaming pressure, not primary RDMA target. |
Collect:
rustfs_system_network_internode_* metricsThe baseline should produce a machine-readable artifact, for example
target/bench/internode-transport/<timestamp>/summary.csv, plus the exact
commands and configuration used.
Use scripts/run_internode_transport_baseline.sh to execute a reproducible
S3 PUT/GET matrix against local and distributed scenarios and export:
summary.csv (throughput/latency summary per workload and object size)internode_metric_deltas.csv (operation-level internode metric deltas when
--metrics-url is provided)Keep NodeService as the control plane. Introduce a separate data transport
only below RemoteDisk, where remote disk byte streams are opened today.
The first implementation should be a no-behavior-change TCP/HTTP backend that
wraps the current HttpReader, HttpWriter, and /rustfs/rpc/* handlers.
Only after that wrapper is benchmarked should an experimental RDMA/RoCE backend
be considered.
The narrowest useful boundary is remote disk stream transfer:
#[async_trait::async_trait]
pub trait InternodeDataTransport: Send + Sync + std::fmt::Debug {
async fn open_read(&self, request: ReadStreamRequest) -> Result<FileReader>;
async fn open_write(&self, request: WriteStreamRequest) -> Result<FileWriter>;
async fn walk_dir(&self, request: WalkDirStreamRequest, writer: &mut dyn AsyncWrite) -> Result<()>;
fn capabilities(&self) -> InternodeTransportCapabilities;
}
Initial request fields should mirror the current HTTP query parameters:
The initial TCP backend can keep the current signed HTTP URLs internally.
RemoteDisk should delegate only these methods to the data transport:
read_file_streamread_file_zero_copy as a wrapper over read_file_stream unless the backend
supports a stronger zero-copy APIappend_filecreate_filewalk_dirAll other RemoteDisk methods should continue using the current gRPC client
until measurements prove otherwise.
Avoid hard-coding RDMA assumptions into the generic interface. Use capabilities:
The first TCP backend should report only capabilities that it actually provides.
TCP/HTTP/gRPC must remain the default and required backend.
Fallback rules:
Suggested future configuration shape:
RUSTFS_INTERNODE_DATA_TRANSPORT=tcp
RUSTFS_INTERNODE_DATA_TRANSPORT_FALLBACK=tcp
Do not add these settings until there is an implementation PR that uses them.
A future RDMA backend should be experimental and feature-gated. It should be designed as an optional data-plane backend, not as a replacement for the gRPC control plane.
Required design areas:
The first RDMA prototype should target read_file_stream and put_file_stream
only. walk_dir, metadata RPCs, locks, admin RPCs, and bucket coordination
should remain on gRPC unless a later benchmark identifies a specific bottleneck.
These technologies should not drive the first abstraction:
/rustfs/rpc/read_file_stream,
/rustfs/rpc/put_file_stream, /rustfs/rpc/walk_dir, and gRPC disk byte
calls.InternodeDataTransport wrapper with a TCP/HTTP backend that
preserves current behavior.RemoteDisk stream methods to the transport wrapper without changing
default behavior.ReadAll and WriteAll stay as gRPC unary calls, or should large
payloads be redirected to the data transport?walk_dir a metadata control stream or a secondary data-plane stream for
scanner/healing workloads?