plans/file_load_enhancements.md
In a prior session, these items were analyzed and already committed to Dev_Master:
| Done | Item | File(s) |
|---|---|---|
| Yes | Bug fix: BeginWaitCursor timing — wait cursor now activates after file size is known | src/Edit.c |
| Yes | Enhancement: FILE_FLAG_SEQUENTIAL_SCAN added to CreateFileW | src/Edit.c:1197 |
| Yes | Bug fix: Style_SetLexerFromFile no longer overrides SC_CACHE_DOCUMENT for small files | src/Styles.c:1765-1768 |
| Yes | Enhancement: Idle styling tiered — SC_IDLESTYLING_AFTERVISIBLE for files >2MB | src/Styles.c:1785-1791 |
| Yes | Bug fix: UTF-16 WideCharToMultiByte now uses exact char counts (no off-by-one) | src/Edit.c:1399-1412 |
This plan covers the 3 remaining enhancements that require moderate-to-major refactoring and should be profiled before/after to validate the benefit.
CreateFileW(FILE_FLAG_SEQUENTIAL_SCAN)
→ ReadFileXL() [chunked DWORD_MAX reads]
→ ReadAndDecryptFile() [optional AES-256 decrypt]
→ Encoding_DetectEncoding() [BOM check, uchardet, UTF-8 validation]
→ Encoding conversion:
UTF-16 path: SwabEx (if BE) → WideCharToMultiByteEx → UTF-8
UTF-8 path: direct (skip BOM if present)
ANSI/MBCS: MultiByteToWideCharEx → WideCharToMultiByteEx → UTF-8
→ EditSetNewText()
→ _PrepareDocBuffer() [clear markers, set wrap=NONE]
→ EditSetDocumentBuffer()
→ CreateNewDocument(): SciCall_CreateDocument + SciCall_ReplaceTarget
→ Style_SetLexerFromFile() [lexer, indicators, idle styling, layout cache]
→ Post-load: caret restore, EOL/indent checks, file watching
| File | Function | Lines | Role |
|---|---|---|---|
src/Edit.c | EditLoadFile() | 1177-1487 | Disk I/O, encoding detection, conversion |
src/Edit.c | EditSetNewText() | 419-448 | Buffer → Scintilla handoff orchestration |
src/Edit.c | _PrepareDocBuffer() | 407-417 | Clear markers, disable wrap before load |
src/Config/Config.cpp | EditSetDocumentBuffer() | 2946-2977 | Final Scintilla document creation |
src/Config/Config.cpp | CreateNewDocument() | 2898-2923 | SCI_CREATEDOCUMENT + SCI_REPLACETARGET |
src/Helpers.c | ReadFileXL() | 835-850 | Chunked disk read (DWORD_MAX chunks) |
src/crypto/crypto.c | ReadAndDecryptFile() | ~463 | Read + optional AES decrypt |
src/EncodingDetection.cpp | Encoding_DetectEncoding() | ~1261 | BOM/uchardet/UTF-8 analysis |
src/Notepad3.c | FileLoad() | ~10917-11230 | Orchestration, post-load processing |
src/Styles.c | Style_SetLexerFromFile() | ~1700-1791 | Lexer, indicators, styling setup |
src/SciCall.h | SciCall_CreateLoader() | 221 | Wrapper: DeclareSciCallR2(CreateLoader, CREATELOADER, sptr_t, DocPos, bytes, int, options) |
scintilla/include/ILoader.h | ILoader class | 16-22 | AddData(), ConvertToDocument(), Release() |
scintilla/doc/ScintillaDoc.html | 7587-7630 | ILoader documentation |
SCI_CREATELOADER (ILoader) for UTF-8 Fast PathPriority: Medium — reduces peak memory for the most common encoding path
Effort: ~30 lines changed in Config.cpp
Risk: Moderate — new Scintilla API usage, needs profiling
CreateNewDocument() in src/Config/Config.cpp:2898-2923 currently:
SciCall_CreateDocument(lenText, docOptions) — Scintilla allocates a document bufferSciCall_SetDocPointer(pNewDocumentPtr) — installs the empty documentSciCall_TargetWholeDocument() + SciCall_ReplaceTarget(lenText, lpstrText) — copies all data into itStep 3 copies the entire UTF-8 buffer into Scintilla's internal buffer. During this copy, both the app's lpData/lpDataUTF8 buffer and Scintilla's document buffer exist simultaneously. For a 500MB file, that's ~1GB peak memory.
Scintilla's ILoader interface (scintilla/include/ILoader.h) allows feeding data directly into a document under construction via AddData(), avoiding the double-buffer:
#include "ILoader.h" // scintilla/include/ILoader.h
// In CreateNewDocument(), replace the SCI_CREATEDOCUMENT + SCI_REPLACETARGET path:
sptr_t const loaderPtr = SciCall_CreateLoader((DocPos)lenText, docOptions);
if (loaderPtr) {
Scintilla::ILoader* pLoader = reinterpret_cast<Scintilla::ILoader*>(loaderPtr);
static constexpr size_t CHUNK_SIZE = 4 * 1024 * 1024; // 4MB chunks
int status = SC_STATUS_OK;
for (size_t offset = 0; offset < lenText && status == SC_STATUS_OK; offset += CHUNK_SIZE) {
size_t const n = min(CHUNK_SIZE, lenText - offset);
status = pLoader->AddData(lpstrText + offset, (Sci_Position)n);
}
if (status == SC_STATUS_OK) {
void* pDoc = pLoader->ConvertToDocument(); // ownership transferred
SciCall_SetDocPointer((sptr_t)pDoc);
SciCall_ReleaseDocument((sptr_t)pDoc);
} else {
pLoader->Release();
// fall back to current approach
}
}
Config.cpp is already C++, so reinterpret_cast and Scintilla::ILoader are availableSciCall_CreateLoader wrapper already exists in SciCall.h:221ILoader::AddData() returns SC_STATUS_* codes — check for SC_STATUS_OKConvertToDocument() returns a void* doc pointer; Scintilla owns it after SetDocPointer, caller must ReleaseDocumentreload parameter (for SciCall_ReplaceTargetMinimal) is NOT compatible with ILoader — ILoader always creates a fresh document. So ILoader should only be used when !reload (first load, not file revert). The reload path should keep using ReplaceTargetMinimal.#include "ILoader.h" — verify include search paths in the Config.cpp compilation unitPriority: Low — only affects non-UTF-8 files; complexity outweighs benefit Effort: Complex Risk: High — encoding edge cases
The ANSI/MBCS conversion path in src/Edit.c:1446-1474 performs:
Step 1: lpData = raw file data [fileSize bytes]
Step 2: lpDataWide = AllocMem(cbData * 2 + 16) [2x fileSize]
MultiByteToWideCharEx → fills lpDataWide
Step 3: FreeMem(lpData)
lpData = AllocMem(cbDataWide * 3 + 16) [up to 3x wchar count]
WideCharToMultiByteEx → fills new lpData
Step 4: EditSetNewText(lpData) → copies into Scintilla
Step 5: FreeMem(lpDataWide), FreeMem(lpData)
Between steps 2-3, peak memory is fileSize + 2*fileSize + 3*wcharCount ≈ up to 6x fileSize for worst-case MBCS expansion. After step 3 frees the original lpData, peak drops, but the intermediate spike can be significant for large ANSI files.
Query exact sizes first: Call MultiByteToWideChar and WideCharToMultiByte with NULL output to get exact sizes, then allocate tightly. Currently uses worst-case multipliers (*2, *3).
Reuse buffer when safe: For single-byte encodings (Latin-1, CP1252, CP1250, etc.), the UTF-8 expansion is at most 2x. If cbDataWide * 2 + 16 <= SizeOfMem(original lpData), reuse the original buffer. But this requires knowing the encoding class.
Combine with ILoader (Enhancement A): If Enhancement A is done, the ANSI path could feed UTF-8 chunks into ILoader instead of building a full UTF-8 buffer. But this requires chunked conversion — see Enhancement C.
The current code is correct, clear, and handles all edge cases. The triple-allocation is inherent to Win32's two-step conversion (no direct ANSI→UTF-8 API). Only implement if profiling shows this path is a real-world bottleneck for specific users with large non-UTF-8 files.
Priority: High value, high effort — the "ultimate" file loading optimization Effort: Major architectural change (~200+ lines, multiple files) Risk: High — multi-byte boundary handling, encryption interaction
Full streaming pipeline replacing the current "read everything → convert everything → copy to Scintilla":
Disk → [4MB chunks] → Encoding Conversion → ILoader::AddData() → ConvertToDocument()
Benefits:
┌─────────────────────────────────────────────────────┐
│ EditLoadFile() │
│ │
│ 1. CreateFileW (FILE_FLAG_SEQUENTIAL_SCAN) │
│ 2. Read first 4MB chunk │
│ 3. Detect encoding from first chunk: │
│ - BOM check, uchardet, UTF-8 validation │
│ - FileVars_GetFromData (first chunk only) │
│ 4. Create ILoader: SciCall_CreateLoader(fileSize) │
│ 5. Loop: │
│ a. ReadFileXL(chunk, 4MB) │
│ b. Convert chunk: Source → [UTF-16] → UTF-8 │
│ c. ILoader::AddData(utf8Chunk, utf8Len) │
│ d. Report progress (optional) │
│ e. Check cancellation (optional) │
│ 6. doc = ILoader::ConvertToDocument() │
│ 7. SciCall_SetDocPointer(doc) │
│ │
│ Fallback: if ILoader fails, use current monolithic │
│ approach │
└─────────────────────────────────────────────────────┘
When reading in 4MB chunks, a chunk boundary can land in the middle of a multi-byte sequence:
// Example: find safe UTF-8 split point
size_t SafeUTF8Split(const char* data, size_t len) {
if (len == 0) return 0;
// Scan backward up to 3 bytes to find a complete sequence
for (size_t i = 1; i <= min(3, len); i++) {
unsigned char c = (unsigned char)data[len - i];
if ((c & 0x80) == 0) return len; // ASCII — complete
if ((c & 0xC0) == 0xC0) { // Start byte
int expected = (c >= 0xF0) ? 4 : (c >= 0xE0) ? 3 : 2;
if (i >= expected) return len; // Complete sequence
return len - i; // Incomplete — split before it
}
}
return len; // All continuation bytes within 3 — shouldn't happen in valid UTF-8
}
Encoding_DetectEncoding() needs the raw data to detect encoding. Solution: read the first chunk, run detection, then process all chunks (including the first) through the encoding conversion loop.
ReadAndDecryptFile() currently reads the entire file, then decrypts in-place. AES-256 CBC decryption is block-based (16 bytes) and could work on chunks, but the current implementation (src/crypto/crypto.c) is monolithic. Fallback: For encrypted files, use the current monolithic path.
FileVars_GetFromDataFileVars_GetFromData() scans for Emacs/Vim file variables in the first few lines. It only needs the first chunk. Call it on the first chunk's UTF-8 output before continuing the loop.
EditDetectEOLModeCurrently runs on the complete UTF-8 buffer. For chunked loading, accumulate EOL counts during the conversion loop (count \r\n, \r, \n as you go) and determine the mode at the end.
| File | Changes |
|---|---|
src/Edit.c | Major rewrite of EditLoadFile() to add chunked path alongside monolithic fallback |
src/Config/Config.cpp | New function (or modify EditSetDocumentBuffer) to accept ILoader instead of buffer |
src/Edit.h | New function declarations if needed |
src/Helpers.c | Possibly add SafeEncodingSplit() helper |
src/crypto/crypto.c | Skip chunked path for encrypted files (no change needed) |
Each phase is independently testable and can be shipped separately.
Build\Build_x64.cmd Release — clean compile, no warningsFileRevert) preserves diff-minimal behavior (ReplaceTargetMinimal)FILE_FLAG_SEQUENTIAL_SCAN interaction)