plans/2025-04-27-large-file-read-range-support-v4.md
Implement support for reading extremely large text files by adding range parameters (start_byte and end_byte) to the file read tool, allowing users to read specific portions of large files without loading the entire file into memory. Binary files should not be supported and UTF-8 character boundaries must always be respected.
Update FsReadService interface to support range reading
crates/forge_services/src/infra.rsImplement the range reading functionality in ForgeFileReadService
crates/forge_infra/src/fs_read.rsAdd binary file detection using the infer crate
crates/forge_fs/src/lib.rscrates/forge_fs/Cargo.tomlUpdate ForgeFS to support range reading with binary file validation
crates/forge_fs/src/lib.rsImplement UTF-8 boundary detection and correction
crates/forge_fs/src/lib.rsUpdate the FSReadInput struct to include optional range parameters
crates/forge_services/src/tools/fs/fs_read.rsModify FSRead tool implementation to support range reading and reject binary files
crates/forge_services/src/tools/fs/fs_read.rscall method to use the range-based reading with UTF-8 boundary adjustment and ensure binary files are rejectedUpdate the FSRead tool description
Dependencies: Task 6
Files:
crates/forge_services/src/tools/fs/fs_read.rsNotes: Update docstring to include range parameters in the tool description and explicitly mention that binary files are not supported and UTF-8 boundaries are always respected
Sample Response:
---
path: /a/b/c.txt
range: 100-200
total: 1024
---
Hello! This is the contents of file c.txt
Status: Not Started
Implement file size detection logic
crates/forge_fs/src/lib.rsAdd content length information to range read responses
crates/forge_services/src/tools/fs/fs_read.rsAdd unit tests for range-based file reading and binary file rejection
crates/forge_services/src/tools/fs/fs_read.rscrates/forge_infra/src/fs_read.rscrates/forge_fs/src/lib.rsPerformance issues with extremely large text files
Mitigation:
UTF-8 boundary adjustment overhead
Mitigation:
Breaking changes to the existing API
Mitigation:
Inaccurate binary file detection
Mitigation:
File locking and concurrent access issues
Mitigation:
Memory consumption with large ranges
Mitigation:
Platform-specific issues
Mitigation:
Invalid UTF-8 sequences in text files
Mitigation:
Dependency management issues with infer crate
Mitigation:
Confusion for users with the new metadata in responses
Mitigation:
Streaming API: Implement a streaming interface for file reading instead of range-based reading. This would allow progressive loading of large files but would require more significant changes to the tool interface.
File Pagination Tool: Create a separate tool specifically for paginated file reading, leaving the original file read tool unchanged. This would maintain perfect backward compatibility but introduce redundancy.
Content-Based Partitioning: Implement intelligent partitioning based on content (e.g., by line, by paragraph, by JSON object) rather than raw bytes. This would be more semantic but more complex to implement.
Fixed-size chunking: Instead of arbitrary byte ranges, implement a chunking system where files are divided into fixed-size chunks that can be requested by index. This would simplify the API but reduce flexibility.
Smart text-only file reading: Implement a detection mechanism that automatically determines the optimal portion of a text file to return based on the context of the request, using language-aware boundaries like paragraphs or code blocks.
Custom binary detection instead of infer: Implement our own binary detection logic instead of relying on an external crate. This would reduce dependencies but require more maintenance and could be less accurate.
For the FSReadInput struct, add the following optional parameters:
/// Optional start position in bytes (0-based)
pub start_byte: Option<u64>,
/// Optional end position in bytes (exclusive)
pub end_byte: Option<u64>,
To detect binary files, we'll use the infer crate:
Add the infer crate to dependencies in Cargo.toml:
[dependencies]
infer = "0.15.0" # Use the latest version
Implement a utility function that:
infer::is_image(), infer::is_video(), infer::is_audio(), infer::is_archive(), etc. to detect binary formatsWhen a file is detected as binary, return an error message like: "Binary files are not supported. File detected as [file type]. Please use another tool or method to process this file."
To ensure range reads respect UTF-8 character boundaries:
For the start position:
For the end position:
Report the adjusted positions in the response metadata
The response will include:
The fs_read tool will return JSON with the following structure:
{
"content": "This is the file content within the requested range...",
"metadata": {
"file_size": 1024000,
"requested_range": {
"start_byte": 500,
"end_byte": 1500
},
"actual_range": {
"start_byte": 498,
"end_byte": 1503
},
"boundary_adjustments": {
"start_adjusted": true,
"start_adjustment_reason": "UTF-8 character boundary alignment",
"end_adjusted": true,
"end_adjustment_reason": "UTF-8 character boundary alignment"
},
"is_partial": true,
"percent_of_file": 0.1
}
}
For error cases:
{
"error": "Binary files are not supported. File detected as image/png. Please use another tool or method to process this file."
}
Or for invalid ranges:
{
"error": "Invalid range specified: start_byte (5000) is greater than end_byte (4000)."
}
To minimize memory usage and improve performance:
tokio::fs::File::open() to get a file handlefile.metadata() to get the file size without reading contentfile.seek() to position near start_bytefile.take(adjusted_end_byte - adjusted_start_byte) to create a limited reader