crates/ecstore/docs/internode-transport/internode-transport-adapter-rfc.md
Status: draft Last updated: 2026-05-22 Scope: OSS internode data-plane adapter analysis, benchmark baseline, and transport boundary
The current distributed internode paths use TCP-based HTTP/gRPC transports:
tonic gRPC NodeService for most control, metadata, lock, health, and
peer operations./rustfs/rpc/ for remote disk file streams.This document frames the existing work as an OSS InternodeDataTransport
adapter boundary. The adapter keeps RustFS data-plane logic separate from the
concrete transport backend while preserving the current TCP/HTTP behavior as
the default implementation.
Current implementation status:
InternodeDataTransport exists in
crates/ecstore/src/rpc/internode_data_transport.rs.tcp-http; tcp is accepted as
an alias.RUSTFS_INTERNODE_DATA_TRANSPORT selects the backend. Blank or unset values
use tcp-http; invalid values fail closed.RemoteDisk::read_file_stream, RemoteDisk::create_file,
RemoteDisk::append_file, and RemoteDisk::walk_dir delegate to the
transport.NodeService gRPC remains the internode control plane and continues to carry
metadata/control operations.Related design notes in this directory:
transport-capabilities.mdtransport-buffer-lifecycle.mdtransport-buffer-contract.mdtransport-fallback-and-selection.mdtransport-metrics-and-baseline.mdThe OSS scope is:
InternodeDataTransport adapter boundary;tcp-http as the default backend;The OSS scope is not:
InternodeDataTransport adapter and the paths that remain on gRPC.tonic gRPC for control-plane RPCs.The main HTTP server builds a hybrid service per connection:
rustfs/src/server/http.rs wires a NodeServiceServer for gRPC.rustfs/src/storage/rpc/InternodeRpcService intercepts HTTP paths under
/rustfs/rpc/.Compression logic already treats /rustfs/rpc/ and /rustfs/peer/ as internode
RPC paths and skips normal response compression for them.
crates/protos/src/lib.rs creates internode gRPC channels with tonic
Endpoint:
This confirms the current gRPC transport is TCP/HTTP2-based.
crates/protos/src/node.proto defines one NodeService that mixes several
classes of RPCs:
The service layout is practical today, but it is too broad to become the transport adapter surface. A pluggable data transport should target only disk data streams and keep this gRPC service as the control plane.
These paths carry coordination, metadata, health, and administrative state. They should remain on gRPC/TCP:
| Area | Client/server code | Examples | Notes |
|---|---|---|---|
| Bucket peer ops | crates/ecstore/src/rpc/peer_s3_client.rs, rustfs/src/storage/rpc/bucket.rs | MakeBucket, ListBucket, DeleteBucket, GetBucketInfo, HealBucket | Small metadata/control payloads. |
| Locking | crates/ecstore/src/rpc/remote_locker.rs, rustfs/src/storage/rpc/lock.rs | Lock, UnLock, Refresh, batch lock/unlock | Latency-sensitive but not bulk data; correctness and timeout semantics matter more than transport bandwidth. |
| Peer/admin state | crates/ecstore/src/rpc/peer_rest_client.rs, rustfs/src/storage/rpc/health.rs, metrics.rs, event.rs | LocalStorageInfo, ServerInfo, GetMetrics, GetLiveEvents, reload APIs, rebalance APIs | Operational control plane. |
| Disk metadata/control | crates/ecstore/src/rpc/remote_disk.rs, rustfs/src/storage/rpc/disk.rs | DiskInfo, ReadXL, ReadVersion, ReadMetadata, WriteMetadata, RenameFile, RenamePart, Delete*, VerifyFile, CheckParts | Usually metadata, integrity checks, or namespace mutations. |
| Connection health | RemoteDisk, RemotePeerS3Client, PeerRestClient | TCP connectivity probes and fault/recovery state | Must remain available even if an optional data backend is unavailable. |
These paths move object shard bytes or stream potentially large disk data and are the only reasonable first candidates for a pluggable transport.
| Priority | Path | Current client | Current server | Current transport | Why it matters |
|---|---|---|---|---|---|
| P0 | read_file_stream | RemoteDisk::read_file_stream | handle_read_file in http_service.rs | HTTP GET /rustfs/rpc/read_file_stream with a streaming response body | Main remote disk read stream used by bitrot readers and erasure reads. |
| P0 | put_file_stream | RemoteDisk::create_file and RemoteDisk::append_file | handle_put_file in http_service.rs | HTTP PUT /rustfs/rpc/put_file_stream with a streaming request body | Main remote disk write stream used by bitrot writers and erasure writes. |
| P1 | walk_dir | RemoteDisk::walk_dir | handle_walk_dir in http_service.rs | HTTP GET /rustfs/rpc/walk_dir with a streamed metadata listing | Can be high-volume during scans/healing, but it is metadata-oriented rather than object byte data. |
| P1 | ReadAll / WriteAll | RemoteDisk::read_all / write_all | gRPC unary disk handlers | gRPC unary bytes payload | Moves bytes today, but should be measured before treating it as a high-throughput data path. |
| P2 | proto WriteStream / ReadAt | currently not used | currently returns unimplemented | gRPC streaming definitions exist but are not implemented | Declared proto shape, not a current production path. |
Classification:
InternodeDataTransport: RemoteDisk opens the transfer through
the transport abstraction.InternodeDataTransport| Path | Owner references | Server references | Classification | Notes |
|---|---|---|---|---|
| Remote shard read stream | crates/ecstore/src/rpc/remote_disk.rs::RemoteDisk::read_file_stream; crates/ecstore/src/rpc/internode_data_transport.rs::InternodeDataTransport::open_read; crates/ecstore/src/bitrot.rs::create_bitrot_reader | rustfs/src/storage/rpc/http_service.rs::handle_read_file | Covered by InternodeDataTransport | Object GET, repair reads, and erasure decode use this path for remote shard bytes. |
| Remote shard write stream | RemoteDisk::create_file; RemoteDisk::append_file; InternodeDataTransport::open_write; crates/ecstore/src/bitrot.rs::create_bitrot_writer | rustfs/src/storage/rpc/http_service.rs::handle_put_file | Covered by InternodeDataTransport | Object PUT and multipart part upload use this path for remote shard bytes. |
| Remote namespace walk stream | RemoteDisk::walk_dir; InternodeDataTransport::open_walk_dir; crates/ecstore/src/cache_value/metacache_set.rs walk producers | rustfs/src/storage/rpc/http_service.rs::handle_walk_dir | Covered by InternodeDataTransport | High-volume listing/scanner/heal metadata stream. It is not object byte data, but it is a large internode stream. |
| Remote zero-copy read fallback | RemoteDisk::read_file_zero_copy | same as remote shard read stream | Covered by InternodeDataTransport through read_file_stream | The remote path buffers the stream into Bytes; true zero-copy is not guaranteed for remote disks. |
| Path | Owner references | Server references | Classification | Notes |
|---|---|---|---|---|
ReadAll | RemoteDisk::read_all; crates/ecstore/src/store_init.rs; heal resume metadata readers | rustfs/src/storage/rpc/disk.rs::handle_read_all | Still direct gRPC | Unary bytes response. Currently used mostly for metadata/config files; measure before moving. |
WriteAll | RemoteDisk::write_all; crates/ecstore/src/store_init.rs; heal resume metadata writers | rustfs/src/storage/rpc/disk.rs::handle_write_all | Still direct gRPC | Unary bytes request. Currently used mostly for metadata/config/checkpoint writes. |
ReadMultiple | RemoteDisk::read_multiple; crates/ecstore/src/set_disk/read.rs::read_multiple_files | rustfs/src/storage/rpc/disk.rs::handle_read_multiple | Still direct gRPC | Returns multiple small file payloads, usually metadata/listing support. Could become large with many entries. |
ReadParts | RemoteDisk::read_parts; crates/ecstore/src/set_disk/read.rs::read_parts; multipart list/complete paths | rustfs/src/storage/rpc/disk.rs::handle_read_parts | Still direct gRPC | Encoded ObjectPartInfo metadata, not object data. |
RenamePart | RemoteDisk::rename_part; crates/ecstore/src/set_disk/write.rs::rename_part | rustfs/src/storage/rpc/disk.rs::handle_rename_part | Still direct gRPC | Carries part metadata while committing multipart data already written through stream writers. |
ListDir | RemoteDisk::list_dir; multipart/lifecycle metadata listing callers | rustfs/src/storage/rpc/disk.rs::handle_list_dir | Still direct gRPC | Directory name listing, metadata/control-plane unless measured otherwise. |
Legacy gRPC WalkDir | rustfs/src/storage/rpc/node_service.rs::NodeService::walk_dir | same file | Still direct gRPC | Server implementation remains, but current RemoteDisk::walk_dir uses HTTP through the transport. Keep until callers are audited or compatibility policy is set. |
| Area | Owner references | Classification | Notes |
|---|---|---|---|
| Disk metadata and namespace mutations | RemoteDisk::{read_metadata,write_metadata,update_metadata,read_version,read_xl,rename_data,rename_file,delete*,verify_file,check_parts,disk_info} | Metadata/control-plane only | These remain on gRPC by design. |
| Peer/bucket/admin operations | crates/ecstore/src/rpc/{peer_s3_client.rs,peer_rest_client.rs,remote_locker.rs} and matching rustfs/src/storage/rpc/* handlers | Metadata/control-plane only | Not candidates for a data-plane backend without separate measurements. |
| Store init and format operations | crates/ecstore/src/store_init.rs | Metadata/control-plane only | Uses ReadAll/WriteAll for small format/config objects. |
| Heal orchestration | crates/heal/src/heal/storage.rs and crates/ecstore/src/set_disk.rs::heal_object | Metadata/control-plane plus covered data reads | Heal object data reads go through get_object_reader and then covered shard streams; resume/checkpoint metadata uses direct gRPC disk metadata calls. |
| Path | Owner references | Classification | Notes |
|---|---|---|---|
Proto Write | crates/protos/src/node.proto; rustfs/src/storage/rpc/disk.rs::handle_write | Not relevant | Handler is unimplemented. |
Proto WriteStream | crates/protos/src/node.proto; rustfs/src/storage/rpc/node_service.rs::write_stream | Not relevant | Returns unimplemented. |
Proto ReadAt | crates/protos/src/node.proto; rustfs/src/storage/rpc/node_service.rs::read_at | Not relevant | Returns unimplemented. |
| E2E reliant gRPC helpers | crates/e2e_test/src/reliant/* | Not relevant | Test harnesses, not production internode data-path callers. |
| Risk | Limitation |
|---|---|
| Medium | ReadAll and WriteAll still carry unary bytes over gRPC. They appear metadata-oriented today, but there is no size threshold or routing policy. |
| Medium | ReadMultiple can aggregate many metadata files into one gRPC response. |
| Low | Legacy gRPC WalkDir remains implemented while RemoteDisk::walk_dir uses HTTP through the transport. |
| Medium | Remote read_file_zero_copy is a buffered read over the transport, not a remote zero-copy contract. |
| Medium | Server-side TCP HTTP route handling is outside the client-side trait. |
For object PUTs in distributed erasure mode, the relevant flow is:
SetDisks selects local and remote disks.create_bitrot_writer calls disk.create_file(...) for each shard writer.RemoteDisk::create_file delegates to
InternodeDataTransport::open_write.HttpWriter sends an HTTP PUT to /rustfs/rpc/put_file_stream.handle_put_file opens the local file writer and copies
incoming body chunks into it.Erasure::encode writes shards through MultiWriter to all selected
writers while enforcing write quorum.This is the primary write data-plane candidate.
For object GETs and repair reads in distributed erasure mode, the relevant flow is:
SetDisks prepares shard readers for the selected disks.create_bitrot_reader uses local zero-copy only when disk.is_local().disk.read_file_stream(...).RemoteDisk::read_file_stream delegates to
InternodeDataTransport::open_read.HttpReader sends an HTTP GET to /rustfs/rpc/read_file_stream.handle_read_file opens the local disk stream and returns
it as an HTTP streaming body.This is the primary read data-plane candidate.
RustFS already has coarse internode metrics in crates/io-metrics/src/internode_metrics.rs:
These metrics are useful as a starting point. For backend comparisons, the relevant route-level and operation-level dimensions are:
read_file_streamput_file_streamwalk_dirReadAll / WriteAllExisting benchmark assets:
scripts/run_object_batch_bench.shscripts/run_object_batch_bench_enhanced.shscripts/run_object_batch_bench_abc.shscripts/run_four_node_cluster_failover_bench.shscripts/run_internode_transport_baseline.sh (scenario matrix wrapper for local vs distributed TCP baseline artifacts)crates/ecstore/benches/These mostly cover S3/object workload or erasure coding performance. They do not yet isolate internode transport cost.
Before changing internode data transport behavior or comparing a non-default backend, collect a baseline for the current TCP/HTTP/gRPC implementation.
Minimum:
Preferred:
| Workload | Sizes | Concurrency | Main signal |
|---|---|---|---|
| S3 PUT | 4 KiB, 1 MiB, 16 MiB, 128 MiB, 1 GiB | 1, 16, 64, 128 | End-to-end write throughput and tail latency. |
| S3 GET | 4 KiB, 1 MiB, 16 MiB, 128 MiB, 1 GiB | 1, 16, 64, 128 | End-to-end read throughput and tail latency. |
| Remote disk stream read | shard-sized ranges from read_file_stream | 1, 16, 64 | Isolated internode read path. |
| Remote disk stream write | shard-sized writes through put_file_stream | 1, 16, 64 | Isolated internode write path. |
| Healing / repair | missing disk or missing shard scenario | controlled | Rebuild throughput and read/write amplification. |
| Scanner walk | large bucket/object namespace | controlled | Metadata streaming pressure, not the primary object-byte transport path. |
Collect:
rustfs_system_network_internode_* metricsThe baseline should produce a machine-readable artifact, for example
target/bench/internode-transport/<timestamp>/summary.csv, plus the exact
commands and configuration used.
Use scripts/run_internode_transport_baseline.sh to execute a reproducible
S3 PUT/GET matrix against local and distributed scenarios and export:
summary.csv (throughput/latency summary per workload and object size)internode_metric_deltas.csv (operation-level internode metric deltas when
--metrics-url is provided)See transport-metrics-and-baseline.md for current metric names, labels,
operation values, baseline inputs, and baseline artifact fields.
Keep NodeService as the control plane. Introduce a separate data transport
only below RemoteDisk, where remote disk byte streams are opened today.
The first implementation should be a no-behavior-change TCP/HTTP backend that
wraps the current HttpReader, HttpWriter, and /rustfs/rpc/* handlers.
Non-default backend work should not proceed until the default wrapper is
measured and adapter gaps are documented.
The current boundary is remote disk stream transfer:
#[async_trait::async_trait]
pub trait InternodeDataTransport: Send + Sync + std::fmt::Debug {
async fn open_read(&self, request: ReadStreamRequest) -> Result<FileReader>;
async fn open_write(&self, request: WriteStreamRequest) -> Result<FileWriter>;
async fn open_walk_dir(&self, request: WalkDirStreamRequest) -> Result<FileReader>;
fn name(&self) -> &'static str;
fn capabilities(&self) -> InternodeDataTransportCapabilities;
}
Initial request fields should mirror the current HTTP query parameters:
The initial TCP backend can keep the current signed HTTP URLs internally.
RemoteDisk delegates only these methods to the data transport:
read_file_streamread_file_zero_copy as the current wrapper over read_file_streamappend_filecreate_filewalk_dirAll other RemoteDisk methods continue using the current gRPC client
in this adapter scope.
Avoid hard-coding transport-specific assumptions into the generic interface. The current conservative capability fields are:
The TCP/HTTP backend should report only capabilities that it actually provides.
TCP/HTTP/gRPC must remain the default and required backend.
Fallback rules:
RUSTFS_INTERNODE_DATA_TRANSPORT are
tcp-http and the tcp alias. Empty and unset values use tcp-http.Do not add fallback settings until there is an implementation PR that uses them.
Dry-run command:
scripts/run_internode_transport_baseline.sh \
--access-key minioadmin \
--secret-key minioadmin \
--scenarios local=http://127.0.0.1:9000,distributed=http://127.0.0.1:9001 \
--sizes 4KiB,1MiB \
--concurrencies 1 \
--duration 10s \
--dry-run
Real TCP baseline command with metrics:
RUSTFS_INTERNODE_DATA_TRANSPORT=tcp-http \
scripts/run_internode_transport_baseline.sh \
--access-key "$RUSTFS_ACCESS_KEY" \
--secret-key "$RUSTFS_SECRET_KEY" \
--scenarios local=http://127.0.0.1:9000,distributed=http://127.0.0.1:9001 \
--metrics-url http://127.0.0.1:9000/metrics \
--out-dir target/bench/internode-transport/manual-run
Expected artifacts:
run_manifest.txtsummary.csvinternode_metric_deltas.csv when --metrics-url is providedThe baseline validates the default TCP/HTTP path only. It must not be used to claim support or performance for any other transport.
The current adapter boundary has these constraints:
tcp-http is the default and only OSS backend.The current RemoteDisk::walk_dir stream is routed through the adapter.
Metadata RPCs, locks, admin RPCs, bucket coordination, and the legacy gRPC
WalkDir handler remain outside the current data-plane boundary.
This RFC does not add a plugin system, split the adapter into a separate crate, add accepted backend values, or implement a new transport backend.