docs/DeferredSignals.md
FEX-Emu has locations in its code which are effectively "uninterruptible". In the sense that if the guest application receives a signal during an "uninterruptible" code section, then FEX is likely to hang or crash in spurious and terrible ways.
When FEX is in the process of emitting code, it often needs to acquire mutexes to safeguard operations like memory allocations or reading guest state. This puts FEX in a vulnerable state: If a signal is received in the middle of this, FEX may need to initiate compilation of new code. In this case a mutex could already be held, so attempting to acquire it again would trigger a deadlock.
One solution to this problem is to mask all signals going in to an uninterruptible section and then unmask when leaving. This is the classical approach that is viable if performance isn't a significant concern. A major problem is that it requires two system calls per "uninterruptible" code section, which adds overhead that may exceed the runtime of the section itself.
A new solution is to defer asynchronous signals caught inside an uninterruptible section and handle them at the end of that section.
At the basic level, we increment a reference counter going in to the "uninterruptible" section, and then decrement the reference counter once we leave.
This way when the signal handler receives a signal, it can check that thread's reference counter, store the siginfo_t to an array/stack object, and
return to the same code segment to be handled later.
By making this check as cheap as possible, overhead is minimized for the general case that no signal occurs during "uninterruptible" sections. FEX achieves this by maintaining two memory regions for tracking deferred signals per thread.
This region is FEX's InternalThreadState object, which is always resident for each guest thread and usually inside a register inside the JIT. Inside this object is where the reference counter for "uninterruptible" code segments lives. It is specifically a reference counter since these code segments may nest inside each other and we can only interrupt with a signal if the counter is zero.
This reference counter is thread local and won't be read by any other threads, so it can be a non-atomic increment and decrement. Meaning it is usually three instructions (on ARM64) to increment and decrement.
NonAtomicRefCounter<uint64_t> DeferredSignalRefCount;
This memory region is a single page of memory that is allocated per thread. Its purpose is to trigger a SIGSEGV when FEX leaves an "uninterruptible" section if a signal has been deferred. FEX's signal handler will check if the faulting address is in this special page and subsequently starts the deferred signal mechanisms.
NonAtomicRefCounter<uint64_t> *InterruptFaultPage;
; Increment the reference counter.
ldr x0, [x28, #(offsetof(CPUState, DeferredSignalRefCount))]
add x0, x0, #1
str x0, [x28, #(offsetof(CPUState, DeferredSignalRefCount))]
; Do the uninterruptible code section here.
<...>
; Now decrement the reference counter.
ldr x0, [x28, #(offsetof(CPUState, DeferredSignalRefCount))]
sub x0, x0, #1
str x0, [x28, #(offsetof(CPUState, DeferredSignalRefCount))]
; Just store zero. (1 cycle plus no dependencies on a register. Super fast!)
; Will store fine with no deferred signal, or SIGSEGV if there was one!
strb wxr, [x28, #(offsetof(CPUState, InterruptFaultPage))]
In the case that FEX has received a signal, FEX's signal handler will first check to see if that thread's reference counter is zero or not.
This is the easy case, just handle the signal as normal.
The signal handler now knows that FEX is in an uninterruptible code section. We check the signal to see if it is a synchronous signal or not.
The deferring process starts with storing the kernel siginfo_t to a thread local array so we can restore it later.
We then modify the permissions on the thread local InterruptFaultPage to be PROT_NONE.
We then immediately return from the signal handler so that FEX can resume its "uninterruptible" code section without breaking anything.
Once the "uninterruptible" code section finishes, FEX will intentionally trigger a SIGSEGV by storing to the page.
Once FEX-Emu is in its SIGSEGV handler, it will determine that it is handling a deferred signal. This will pull the previously saved siginfo_t and
start processing the signal.
Once a guest signal handler has finished what it was working on, it will call rt_sigreturn or sigreturn which triggers FEX's SIGILL signal
handler.
Inside of this SIGILL signal handler FEX will restore the state of FEX /back/ to where the deferred signal handler started (The str xzr, [x0]). Then, FEX will check if any further deferred signals need to be handled.
PROT_NONE
Once FEX gets back to the page store, it will trampoline back to the SIGSEGV handler if it has more signals to handle.
sigreturn to handle stacked deferred signals, so a longjmp would interfere with thisThere are two edges to this problem. The incrementing edge and the decrementing edge that must be considered.
This is the most problematic edge. This takes three instructions (one on x86) to increment the ref counter. If a signal is received between the load and store then this theoretically could result in a tear on the refcounter. In actual practice this is a real tear but doesn't cause any problems.
The reasoning for this is that FEX isn't in the "uninterruptible" section until that reference counter has been stored, so FEX will handle the signals immediately at that point, return to this code location, and then increment the counter. In particular, once returning to the code location the refcounter will be the original value loaded. So even though it is a tear, it's one that doesn't cause issues since it is all thread local.
This edge is far less problematic to understand compared to the incrementing edge. Signals will get deferred entirely until the store instruction (If storing zero), so FEX will always return to the code region and finish the decrement.
If FEX receives a signal after the decrement store has completed but /before/ the page faulting store has occurred, then FEX will start processing the signal immediately. At which point the fault page will have either RW or NONE permission. FEX will then likely hit another "uninterruptible" code section which will complete the store to the fault page.
RW permission has no problems, it will continue as normal.
NONE will get captured by the fault handler, the fault handler will determine that there was no deferred signals, and set the fault page back to RW permissions and continue execution safely.
This is a simple example because nothing happens.
This is simple because the JIT just handles it.
Deferred signals don't affect anything here because only asynchronous signals get affected.
This is the first interesting example since deferred signals affects it.
This one mostly matches the previous example except the behaviour of deferred signal regions leaving.
In this case, if the thread-local refcount is still >0 on <Exit deferred region> then there are two behaviours.
<Exit deferred region> so the cost of SIGSEGV+PC increment is faster.This has the expectation that recursive deferred regions both aren't very deep (usually only nested a couple times), and that signals are rare. This way there aren't many SIGSEGV checks generated and the signal is finally only handled when reaching the top-most deferred region exit routine.
This is slightly different from the previous iterations since multiple signals in the stack result in odd behaviour.