Back to Rustfs

Internode Buffer Lifecycle and Copy Count

crates/ecstore/docs/internode-transport/transport-buffer-lifecycle.md

1.0.0-beta.511.2 KB
Original Source

Internode Buffer Lifecycle and Copy Count

Status: P1-D analysis only. This document records the current TCP/HTTP internode data path and the ownership boundaries that matter for the backend-neutral InternodeDataTransport adapter. It does not implement a new backend or change production behavior.

Open-source Scope

The OSS scope is:

  • define buffer ownership and copy-count behavior for the current InternodeDataTransport adapter;
  • keep tcp-http as the default backend;
  • keep existing TCP/HTTP behavior unchanged;
  • document copy hotspots and ownership gaps for maintainable transport code;
  • avoid adding dependencies or backend implementations.

The OSS scope is not:

  • adding another transport backend;
  • replacing the current TCP/HTTP path;
  • adding benchmark plans for another transport;
  • changing object correctness semantics.

Scope

The covered paths are the large internode data-plane calls currently routed through InternodeDataTransport:

PathEntryTransport ownerServer owner
Read streamRemoteDisk::read_file_stream in crates/ecstore/src/rpc/remote_disk.rsTcpHttpInternodeDataTransport::open_read in crates/ecstore/src/rpc/internode_data_transport.rs, HttpReader in crates/rio/src/http_reader.rshandle_read_file in rustfs/src/storage/rpc/http_service.rs
Write streamRemoteDisk::create_file, RemoteDisk::append_file in crates/ecstore/src/rpc/remote_disk.rsTcpHttpInternodeDataTransport::open_write in crates/ecstore/src/rpc/internode_data_transport.rs, HttpWriter in crates/rio/src/http_reader.rshandle_put_file in rustfs/src/storage/rpc/http_service.rs
Walk dir streamRemoteDisk::walk_dir in crates/ecstore/src/rpc/remote_disk.rsTcpHttpInternodeDataTransport::open_walk_dir, HttpReaderhandle_walk_dir in rustfs/src/storage/rpc/http_service.rs

Object read/write/heal callers enter these streams through create_bitrot_reader and create_bitrot_writer in crates/ecstore/src/bitrot.rs. Erasure decode and encode then move data through ParallelReader in crates/ecstore/src/erasure_coding/decode.rs and MultiWriter in crates/ecstore/src/erasure_coding/encode.rs.

Read Stream

StepOwnerBuffer typeCopy?Reason
Build requestRemoteDisk::read_file_streamString fields in ReadStreamRequestYesVolume, path, endpoint, and disk references are copied into an owned request before async transport dispatch. This is metadata, not payload.
Select transportTcpHttpInternodeDataTransport::open_readURL String, HeaderMapYesURL and auth headers are HTTP control data. No object bytes are copied here.
Open local file on serverhandle_read_file, LocalDisk::read_file_streamFileCacheReclaimReader boxed as FileReaderNo payload copyThe server owns an async file reader positioned at the requested offset.
File to HTTP bodyread_file_body_streamReaderStream<AsyncRead> yielding BytesYesReaderStream::with_capacity reads from the file into chunk buffers. This is the file-to-network buffer materialization point.
Length limitingrustfs_utils::net::bytes_streamBytesUsually noBytes::truncate adjusts the chunk view when the last chunk exceeds the requested length. It does not copy the retained prefix.
HTTP receiveHttpReader::with_capacity_and_stall_timeoutreqwest::Response::bytes_stream() yielding BytesNetwork stack dependentThe user-level object is Bytes; any kernel/TLS/hyper copy is below the current RustFS abstraction.
Stream to caller bufferHttpReader::poll_readStreamReader<Stream<Item = Bytes>>, caller ReadBufYesStreamReader exposes AsyncRead, so it copies bytes from each Bytes chunk into the caller-provided ReadBuf.
Bitrot verificationBitrotReader::readcaller &mut [u8], hash_buf: Vec<u8>No additional payload copyThe bitrot reader reads hash bytes into hash_buf and payload bytes directly into the supplied output slice. Hash calculation reads the slice.
Erasure shard readParallelReader::readVec<u8> per shardYesEach shard read allocates vec![0u8; shard_size]; data is filled there before decode/reconstruction.
Object response writewrite_data_blocksslices of shard Vec<u8>No extra staging copyDecoded data block slices are written to the target writer with write_all; the target may copy internally.
Remote zero-copy helperRemoteDisk::read_file_zero_copyVec<u8> then BytesYesThe remote implementation reads the full stream into a Vec and converts it into Bytes. It is a convenience fallback, not network zero-copy.

Write Stream

StepOwnerBuffer typeCopy?Reason
Build writer requestRemoteDisk::create_file, RemoteDisk::append_fileString fields in WriteStreamRequestYesVolume, path, endpoint, and disk references are copied into an owned request. This is metadata.
Select transportTcpHttpInternodeDataTransport::open_writeURL String, HeaderMapYesURL and auth headers are HTTP control data. No object bytes are copied here.
Erasure encode inputErasure::encode in encode.rsreusable Vec<u8> sized to block_sizeYesrustfs_utils::read_full fills a block buffer from the source reader before encoding.
Erasure encode outputErasure::encode_data caller in encode.rsVec<Bytes> per encoded blockYesEncoding creates shard Bytes for data and parity blocks before queueing them to writers.
Multi-writer fanoutMultiWriter::writeborrowed Bytes shardsNo additional fanout copyThe writer fanout passes borrowed Bytes references to each BitrotWriterWrapper.
Bitrot writeBitrotWriter::writeshard &[u8], checksum bytesYes for checksum bytesThe payload slice is passed to the inner writer, while checksum bytes are generated and written before payload when enabled.
Client HTTP writer bufferHttpWriter::poll_write and poll_write_vectoredBytesMut pending chunk or Bytes::copy_from_sliceYesSmall writes are coalesced with BytesMut::extend_from_slice; large single writes still copy into an owned Bytes because the async body must outlive the caller's borrowed buffer.
Client channel to reqwestHttpWriter::poll_send_pending_chunk, ReceiverStreamBytesNoBytesMut::split().freeze() transfers owned chunk storage to Bytes; the mpsc channel and stream move the Bytes handle.
HTTP receive body on serverhandle_put_fileIncoming::into_data_stream() yielding BytesNetwork stack dependentThe server receives owned Bytes chunks from hyper.
Server body coalescingwrite_body_chunks_to_writerBytesMut sized to DEFAULT_READ_BUFFER_SIZEYesEach incoming Bytes chunk is copied into pending before writing to the local file writer. This normalizes chunk size but adds a full payload copy.
Local file writeLocalDisk::create_file, LocalDisk::append_file, FileCacheReclaimWriter&[u8] into tokio::fs::FileKernel dependentRustFS passes slices to Tokio file writes. Kernel page-cache copies are below the RustFS abstraction.

Request and Serialization Boundaries

BoundaryOwnerBuffer typeCopy?Notes
Read/write query parametersbuild_read_file_stream_url, build_put_file_stream_urlURL-encoded StringYesMetadata only. It includes disk, volume, path, offset, length, append, and size.
Auth headersbuild_auth_headers callersHeaderMapYesMetadata only. This is currently tied to HTTP request construction.
Walk dir requestRemoteDisk::walk_dir, open_walk_dir, handle_walk_dirJSON Vec<u8> body, collected Bytes on serverYesWalk dir is a streamed response but its request body is serialized JSON control data.
gRPC read/write-allRemoteDisk::read_all, RemoteDisk::write_all, NodeService::{handle_read_all,handle_write_all}Prost Bytes/message bodiesYesThese paths are still gRPC byte paths, not InternodeDataTransport; they matter for metrics and inventory but are outside this P1-D stream-copy count.

Hotspots

RankHotspotImpactReason
1HttpWriter::poll_write and poll_write_vectoredHigh on write pathEvery borrowed caller buffer is copied into owned BytesMut or Bytes before it can be sent by an async HTTP body.
2write_body_chunks_to_writerHigh on write pathThe server copies every received Bytes chunk into a coalescing BytesMut before local disk write.
3ParallelReader::read shard buffersHigh on read pathEach shard read allocates and fills a Vec<u8> before decode can proceed. This is also where degraded reads wait on quorum.
4ReaderStream::with_capacity plus StreamReaderMedium on read pathServer file reads create Bytes chunks, then client AsyncRead copies those chunks into the caller's ReadBuf.
5Erasure::encode block and shard materializationMedium on write pathSource data is first read into a block Vec<u8>, then encoded into per-shard Bytes. This is necessary for the current erasure API.
6RemoteDisk::read_file_zero_copyMedium when usedRemote zero-copy reads buffer the whole stream into memory. The name does not mean zero-copy over the network.
7URL/query/header/JSON serializationLowMetadata copies are small and not on the large payload hot path.

Adapter Ownership Gaps

  1. FileReader and FileWriter are boxed AsyncRead/AsyncWrite trait objects. They expose borrowed buffers per poll, not stable backend-owned regions, transfer handles, or explicit completion ownership.
  2. InternodeDataTransport currently returns stream traits only. Its capabilities advertise that TCP/HTTP does not require backend-specific buffer registration and is not a zero-copy candidate, but there is no backend API to pass stable backend-managed buffers.
  3. HttpWriter must own outgoing chunks because the async request body outlives the caller's borrowed &[u8]. A lower-copy backend would need a different lifetime contract or an owned buffer pool.
  4. Server write handling normalizes all incoming body chunks into a new BytesMut. Avoiding that copy would require passing incoming Bytes or backend-owned receive buffers directly into the disk/bitrot writer contract.
  5. Erasure decode owns shard Vec<u8> buffers and write-back happens through AsyncWrite. A lower-copy backend would need explicit ownership of shard buffers across decode, reconstruction, and network completion.
  6. Erasure encode materializes Vec<Bytes> blocks before fanout. A backend that can send multiple stable slices would need an encode output representation that can be transferred without repacking.
  7. The HTTP auth and URL construction boundary is part of the current TCP/HTTP backend. A non-HTTP backend would need equivalent peer authentication and disk addressing without assuming URL query parameters.
  8. Local disk zero-copy exists only for local reads via read_file_zero_copy. Remote disks deliberately fall back to network streaming and full-buffer collection for the zero-copy helper.