agents/windows-memory-profiling-port.md
Porting Scalene's memory profiling from Linux/macOS to Windows. On Unix systems, Scalene uses LD_PRELOAD/DYLD_INSERT_LIBRARIES to interpose on malloc/free. Windows doesn't support this mechanism, so we use Python's allocator API instead.
libscalene.so/libscalene.dylib is preloaded via environment variables/tmp/scalene-*) for data transferlibscalene.dll is loaded explicitly via ctypesPyMem_SetAllocator API to intercept allocationsLocal\scalene-*) for data transfersrc/source/libscalene_windows.cpp - Main Windows DLL implementation (includes Detours native hooks)src/include/samplefile_win.hpp - Windows shared memory implementationsrc/include/common_win.hpp - Windows compatibility macrossrc/include/mallocrecursionguard_win.hpp - Windows TLS-based recursion guardscalene/scalene_windows.py - Python helper for Windows memory profilingvendor/Detours/ - Microsoft Detours library for native malloc/free hooking (vendored)CMakeLists.txt - Cross-platform build configuration with Detours integrationsrc/include/pywhere.hpp - Windows DLL export/import macrossrc/source/pywhere.cpp - Windows symbol lookup via accessor functionsscalene/scalene_profiler.py - Windows DLL loading and initializationscalene/scalene_mapfile.py - Windows named shared memory supportscalene/scalene_signal_manager.py - Windows memory polling threadscalene/scalene_preload.py - Case-insensitive ARM64 detectionscalene/scalene_arguments.py - Enable memory profiling on WindowsPyMem_SetAllocatorthread_local variables removed (caused crashes with dynamically loaded DLLs on Windows)scalene_signal_manager.pyalloc_samples: 203
max_footprint_mb: 152.71
max_footprint_python_fraction: 0.000288
Native fraction: 99.97%
The profiler now correctly tracks memory allocations from native libraries like numpy, with proper line attribution.
UnicodeEncodeError on Windows consoles using CP1252 encodingPYTHONIOENCODING=utf-8 environment variable or use --html or --json outputThe Access Violation crash (exit code 3221225477 = 0xC0000005) was caused by a ctypes bug where 64-bit pointers were being truncated to 32-bit values.
Problem: By default, ctypes assumes Windows API functions return c_int (32-bit). On 64-bit Windows, MapViewOfFile returns a 64-bit pointer. Without explicit return type declaration, the high 32 bits were truncated, causing invalid memory addresses.
Example of bug:
MapViewOfFile returns: 0x00007FFBF2650B90 (valid 64-bit pointer)
ctypes stored as: 0x00000000F2650B90 (truncated to 32-bit, invalid!)
Added proper return type declarations before calling Windows API functions:
from ctypes import wintypes
kernel32 = ctypes.windll.kernel32
# IMPORTANT: Set proper return types for Windows API functions
# Default ctypes return type is c_int (32-bit) which truncates 64-bit pointers
kernel32.OpenFileMappingW.restype = wintypes.HANDLE
kernel32.MapViewOfFile.restype = ctypes.c_void_p
kernel32.UnmapViewOfFile.argtypes = [ctypes.c_void_p]
kernel32.CloseHandle.argtypes = [wintypes.HANDLE]
This fix is in scalene/scalene_mapfile.py in the _init_windows() method.
The original Windows implementation only tracked Python allocations via PyMem_SetAllocator. Native library allocations (numpy arrays, pandas dataframes, etc.) went untracked because they bypass Python's allocator and call the C runtime malloc/free directly.
Before: Only Python allocations tracked (0.03% of actual memory usage) After: All allocations tracked including native libraries (99.97% native, 0.03% Python)
We use Microsoft Detours to intercept malloc/free/realloc/calloc at the C runtime level. Detours works by rewriting the first few bytes of target functions with a jump to our hooks.
Why Detours over alternatives:
vendor/Detours/ - Microsoft Detours source (vendored)CMakeLists.txt - Added Detours to Windows build with architecture-specific disassemblersrc/source/libscalene_windows.cpp - Added native hooks using Detours# Microsoft Detours sources for native malloc/free hooking
set(DETOURS_SOURCES
vendor/Detours/src/detours.cpp
vendor/Detours/src/modules.cpp
vendor/Detours/src/disasm.cpp
vendor/Detours/src/image.cpp
vendor/Detours/src/creatwth.cpp
)
# Add architecture-specific disassembler
if(SCALENE_ARCH STREQUAL "ARM64")
list(APPEND DETOURS_SOURCES vendor/Detours/src/disolarm64.cpp)
elseif(SCALENE_ARCH STREQUAL "X64")
list(APPEND DETOURS_SOURCES vendor/Detours/src/disolx64.cpp)
elseif(SCALENE_ARCH STREQUAL "X86")
list(APPEND DETOURS_SOURCES vendor/Detours/src/disolx86.cpp)
endif()
target_compile_definitions(scalene PRIVATE DETOURS_INTERNAL)
#include "detours.h"
// Original function pointers (Detours fills these with trampolines)
static void* (__cdecl *Real_malloc)(size_t) = malloc;
static void (__cdecl *Real_free)(void*) = free;
static void* (__cdecl *Real_realloc)(void*, size_t) = realloc;
static void* (__cdecl *Real_calloc)(size_t, size_t) = calloc;
// Recursion guard - CRITICAL for preventing infinite loops
static bool g_in_native_hook = false;
// Coordination with Python allocator hooks
static bool g_in_python_allocator = false;
static void* __cdecl Hooked_malloc(size_t size) {
// Check recursion guard FIRST - track_native_alloc may call malloc internally
if (g_in_native_hook || g_in_python_allocator) {
return Real_malloc(size);
}
g_in_native_hook = true;
void* ptr = Real_malloc(size);
if (ptr) {
track_native_alloc(ptr, size);
if (!p_scalene_done) {
TheHeapWrapper::register_malloc(size, ptr, false); // false = native
}
}
g_in_native_hook = false;
return ptr;
}
bool install_native_hooks() {
DetourRestoreAfterWith();
DetourTransactionBegin();
DetourUpdateThread(GetCurrentThread());
DetourAttach(&(PVOID&)Real_malloc, Hooked_malloc);
DetourAttach(&(PVOID&)Real_free, Hooked_free);
DetourAttach(&(PVOID&)Real_realloc, Hooked_realloc);
DetourAttach(&(PVOID&)Real_calloc, Hooked_calloc);
return DetourTransactionCommit() == NO_ERROR;
}
Problem: The native hooks use std::unordered_map for size tracking, which internally calls malloc. If the recursion guard was checked AFTER calling tracking functions, infinite recursion occurred:
mallocHooked_malloc is calledtrack_native_alloc(ptr, size) is called (hash map insert)malloc for bucket allocationHooked_malloc is called again (recursion!)g_in_native_hook wasn't set yet, we recurse infinitely → stack overflowFix: Check and set g_in_native_hook at the VERY BEGINNING of each hook, BEFORE any operations that might allocate:
static void* __cdecl Hooked_malloc(size_t size) {
// MUST check recursion guard FIRST
if (g_in_native_hook || g_in_python_allocator) {
return Real_malloc(size); // Bypass hook
}
g_in_native_hook = true; // Set BEFORE any allocating operations
// ... rest of hook logic ...
g_in_native_hook = false;
return ptr;
}
Problem: _msize() doesn't work reliably for custom allocators (numpy uses aligned allocations with _aligned_malloc).
Solution: Manual size tracking with a hash map:
static std::unordered_map<void*, size_t> g_native_alloc_sizes;
static CRITICAL_SECTION g_native_alloc_sizes_lock;
static void track_native_alloc(void* ptr, size_t size) {
EnterCriticalSection(&g_native_alloc_sizes_lock);
g_native_alloc_sizes[ptr] = size;
LeaveCriticalSection(&g_native_alloc_sizes_lock);
}
static size_t untrack_native_alloc(void* ptr) {
EnterCriticalSection(&g_native_alloc_sizes_lock);
auto it = g_native_alloc_sizes.find(ptr);
size_t size = (it != g_native_alloc_sizes.end()) ? it->second : 0;
if (it != g_native_alloc_sizes.end()) g_native_alloc_sizes.erase(it);
LeaveCriticalSection(&g_native_alloc_sizes_lock);
return size;
}
To prevent double-counting allocations that go through both Python's allocator AND the native malloc:
g_in_python_allocator = true when active// In Python allocator hook:
static void* scalene_malloc(void* ctx, size_t len) {
g_in_python_allocator = true; // Prevent native hooks
void* ptr = g_original_mem_allocator.malloc(...);
// ... tracking code ...
g_in_python_allocator = false;
return ptr;
}
restype declarations for all Windows API functions that return handles or pointersMapViewOfFile.restype = ctypes.c_void_p - returns memory addressOpenFileMappingW.restype = wintypes.HANDLE - returns handleargtypes for functions taking pointersChanged malloc/free output from tab-separated 4-field format to comma-separated 9-field format matching Unix:
M,alloc_time,count,python_fraction,pid,pointer,filename,lineno,bytei\n\n
thread_local VariablesWindows DLLs loaded with LoadLibrary/ctypes don't properly support thread_local variables. Changed all to static variables:
mallocSampler and memcpySamplerg_pythonCount, g_cCountg_lastMallocTrigger, g_freedLastMallocTriggerg_memcpyOpsinMalloc (for recursion guard)Since Unix signals (SIGXCPU/SIGXFSZ) don't exist on Windows, added a polling thread in scalene_signal_manager.py:
_windows_memory_poll_loop() - polls every 10ms_alloc_sigqueue_processor and _memcpy_sigqueue_processorAdded null checks for original allocator function pointers to prevent crashes if allocation hooks are called before initialization.
Running the profiler now shows memory profiling data being collected:
malloc_calls=113807912, samples=595, logged=595
The profiler correctly tracks:
# Configure with CMake (for ARM64 native)
cmake -B build -A ARM64 -DPython3_ROOT_DIR="C:\Users\emery\AppData\Local\Programs\Python\Python311-arm64"
# Build the DLL
MSBuild.exe build/scalene.sln -p:Configuration=Release -p:Platform=ARM64 -t:scalene
# DLL is automatically placed in scalene\ directory
# Run profiler
python -m scalene --cli --cpu --memory test\testme.py
| Aspect | Unix | Windows |
|---|---|---|
| Native malloc hooking | LD_PRELOAD | Microsoft Detours (inline function hooking) |
| Python allocator | PyMem_SetAllocator | PyMem_SetAllocator (same) |
| Signals | SIGXCPU/SIGXFSZ | Polling thread |
| Shared Memory | /tmp files + mmap | Named shared memory |
| Symbol Lookup | dlsym(RTLD_DEFAULT) | GetProcAddress + accessor functions |
| Size Tracking | malloc_usable_size | Manual hash map (_msize unreliable) |
| Thread-local | thread_local | Static variables (GIL protects) |
| Pointer handling | Native 64-bit | Requires explicit ctypes declarations |
| Architecture support | All | ARM64, x64, x86 (via Detours) |
restype for any Windows API function that returns a handle or pointerc_int (32-bit) silently truncates 64-bit valuesctypes.c_void_p for memory addresses, wintypes.HANDLE for handlesThe crash was isolated using systematic component elimination:
thread_local variables don't work reliably in DLLs loaded via LoadLibrary/ctypesDllMain) has severe restrictions - avoid complex initialization there-A ARM64 and matching Python versionLocal\ prefix for per-session namespaceg_in_native_hook) MUST be checked and set BEFORE any other operationsif (guard) return original(); guard = true; ... ; guard = false; return result;disolarm64.cpp for ARM64, disolx64.cpp for x64_msize() only works for standard CRT heap allocations_aligned_malloc) which _msize doesn't handleg_in_python_allocator) to prevent native hooks from double-counting Python allocationsx64 Build and Testing: Current implementation tested only on ARM64. Need to verify x64 builds work correctly with Detours.
Unicode Console Output: The UnicodeEncodeError for sparkline characters could be handled more gracefully:
PYTHONIOENCODING=utf-8 when running on WindowsPerformance Optimization: The polling thread polls every 10ms. Consider:
Multi-Process Support: Test and fix any issues with profiling child processes on Windows (the Unix redirect_python mechanism may need Windows adaptation).
GPU Profiling on Windows: Verify NVIDIA GPU profiling works on Windows (pynvml should work but needs testing).
CI/CD Integration: Add Windows builds to GitHub Actions workflow:
Windows Event-Based Signaling: Replace polling with proper Windows Events for lower latency and CPU usage:
ScaleneMallocEvent, etc.)WaitForMultipleObjects instead of pollingMemory Leak Detection: The --memory-leak-detector feature may need Windows-specific testing.
Web UI Testing: Verify the web-based GUI works correctly on Windows (browser launching, port binding).
# Add to scalene_mapfile.py to debug shared memory issues
print(f"Handle value: {handle:#x}", file=sys.stderr)
print(f"View address: {view:#x}", file=sys.stderr)
# Check if pointer looks valid (should have high bits set on 64-bit)
if view < 0x100000000:
print("WARNING: Pointer may be truncated!", file=sys.stderr)
// In libscalene_windows.cpp, temporarily add:
fprintf(stderr, "DEBUG: ptr=%p, size=%zu\n", ptr, size);
Then rebuild with:
MSBuild.exe build/scalene.sln -p:Configuration=Release -p:Platform=ARM64 -t:scalene
0xC0000005 (3221225477) - Access Violation (invalid memory access)0xC0000008 - Invalid Handle0xC000001D - Illegal Instruction (wrong architecture)