Documentation/mm/vmalloced-kernel-stacks.rst
.. SPDX-License-Identifier: GPL-2.0
:Author: Shuah Khan [email protected]
.. contents:: :local:
This is a compilation of information from the code and original patch
series that introduced the Virtually Mapped Kernel Stacks feature <https://lwn.net/Articles/694348/>
Kernel stack overflows are often hard to debug and make the kernel susceptible to exploits. Problems could show up at a later time making it difficult to isolate and root-cause.
Virtually-mapped kernel stacks with guard pages causes kernel stack overflows to be caught immediately rather than causing difficult to diagnose corruptions.
HAVE_ARCH_VMAP_STACK and VMAP_STACK configuration options enable support for virtually mapped stacks with guard pages. This feature causes reliable faults when the stack overflows. The usability of the stack trace after overflow and response to the overflow itself is architecture dependent.
.. note:: As of this writing, arm64, powerpc, riscv, s390, um, and x86 have support for VMAP_STACK.
Architectures that can support Virtually Mapped Kernel Stacks should enable this bool configuration option. The requirements are:
VMAP_STACK bool configuration option when enabled allocates virtually mapped task stacks. This option depends on HAVE_ARCH_VMAP_STACK.
.. note::
Using this feature with KASAN requires architecture support
for backing virtual mappings with real shadow memory, and
KASAN_VMALLOC must be enabled.
.. note::
VMAP_STACK is enabled, it is not possible to run DMA on stack
allocated data.
Kernel configuration options and dependencies keep changing. Refer to the latest code base:
Kconfig <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/Kconfig>
When a new kernel thread is created, thread stack is allocated from virtually contiguous memory pages from the page level allocator. These pages are mapped into contiguous kernel virtual space with PAGE_KERNEL protections.
alloc_thread_stack_node() calls __vmalloc_node_range() to allocate stack with PAGE_KERNEL protections.
Thread stack allocation is initiated from clone(), fork(), vfork(), kernel_thread() via kernel_clone(). Leaving a few hints for searching the code base to understand when and how thread stack is allocated.
Bulk of the code is in:
kernel/fork.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c>.
stack_vm_area pointer in task_struct keeps track of the virtually allocated stack and a non-null stack_vm_area pointer serves as a indication that the virtually mapped kernel stacks are enabled.
::
struct vm_struct *stack_vm_area;
Leading and trailing guard pages help detect stack overflows. When stack overflows into the guard pages, handlers have to be careful not overflow the stack again. When handlers are called, it is likely that very little stack space is left.
On x86, this is done by handling the page fault indicating the kernel stack overflow on the double-fault stack.
How do we ensure that VMAP_STACK is actually allocating with a leading and trailing guard page? The following lkdtm tests can help detect any regressions.
::
void lkdtm_STACK_GUARD_PAGE_LEADING()
void lkdtm_STACK_GUARD_PAGE_TRAILING()