.. SPDX-License-Identifier: GPL-2.0

================= Process Addresses

.. toctree:: :maxdepth: 3

Userland memory ranges are tracked by the kernel via Virtual Memory Areas or 'VMA's of type :c:struct:!struct vm_area_struct.

Each VMA describes a virtually contiguous memory range with identical attributes, each described by a :c:struct:!struct vm_area_struct object. Userland access outside of VMAs is invalid except in the case where an adjacent stack VMA could be extended to contain the accessed address.

All VMAs are contained within one and only one virtual address space, described by a :c:struct:!struct mm_struct object which is referenced by all tasks (that is, threads) which share the virtual address space. We refer to this as the :c:struct:!mm.

Each mm object contains a maple tree data structure which describes all VMAs within the virtual address space.

.. note:: An exception to this is the 'gate' VMA which is provided by architectures which use :c:struct:!vsyscall and is a global static object which does not belong to any specific mm.

Locking

The kernel is designed to be highly scalable against concurrent read operations on VMA metadata so a complicated set of locks are required to ensure memory corruption does not occur.

.. note:: Locking VMAs for their metadata does not have any impact on the memory they describe nor the page tables that map them.

Terminology

mmap locks - Each MM has a read/write semaphore :c:member:!mmap_lock which locks at a process address space granularity which can be acquired via :c:func:!mmap_read_lock, :c:func:!mmap_write_lock and variants.
VMA locks - The VMA lock is at VMA granularity (of course) which behaves as a read/write semaphore in practice. A VMA read lock is obtained via :c:func:!lock_vma_under_rcu (and unlocked via :c:func:!vma_end_read) and a write lock via vma_start_write() or vma_start_write_killable() (all VMA write locks are unlocked automatically when the mmap write lock is released). To take a VMA write lock you must have already acquired an :c:func:!mmap_write_lock.
rmap locks - When trying to access VMAs through the reverse mapping via a :c:struct:!struct address_space or :c:struct:!struct anon_vma object (reachable from a folio via :c:member:!folio->mapping). VMAs must be stabilised via :c:func:!anon_vma_[try]lock_read or :c:func:!anon_vma_[try]lock_write for anonymous memory and :c:func:!i_mmap_[try]lock_read or :c:func:!i_mmap_[try]lock_write for file-backed memory. We refer to these locks as the reverse mapping locks, or 'rmap locks' for brevity.

We discuss page table locks separately in the dedicated section below.

The first thing any of these locks achieve is to stabilise the VMA within the MM tree. That is, guaranteeing that the VMA object will not be deleted from under you nor modified (except for some specific fields described below).

Stabilising a VMA also keeps the address space described by it around.

Lock usage

If you want to read VMA metadata fields or just keep the VMA stable, you must do one of the following:

Obtain an mmap read lock at the MM granularity via :c:func:!mmap_read_lock (or a suitable variant), unlocking it with a matching :c:func:!mmap_read_unlock when you're done with the VMA, or
Try to obtain a VMA read lock via :c:func:!lock_vma_under_rcu. This tries to acquire the lock atomically so might fail, in which case fall-back logic is required to instead obtain an mmap read lock if this returns :c:macro:!NULL, or
Acquire an rmap lock before traversing the locked interval tree (whether anonymous or file-backed) to obtain the required VMA.

If you want to write VMA metadata fields, then things vary depending on the field (we explore each VMA field in detail below). For the majority you must:

Obtain an mmap write lock at the MM granularity via :c:func:!mmap_write_lock (or a suitable variant), unlocking it with a matching :c:func:!mmap_write_unlock when you're done with the VMA, and
Obtain a VMA write lock via :c:func:!vma_start_write for each VMA you wish to modify, which will be released automatically when :c:func:!mmap_write_unlock is called.
If you want to be able to write to any field, you must also hide the VMA from the reverse mapping by obtaining an rmap write lock.

VMA locks are special in that you must obtain an mmap write lock first in order to obtain a VMA write lock. A VMA read lock however can be obtained without any other lock (:c:func:!lock_vma_under_rcu will acquire then release an RCU lock to lookup the VMA for you).

This constrains the impact of writers on readers, as a writer can interact with one VMA while a reader interacts with another simultaneously.

.. note:: The primary users of VMA read locks are page fault handlers, which means that without a VMA write lock, page faults will run concurrent with whatever you are doing.

Examining all valid lock states:

.. table::

========= ======== ========= ======= ===== =========== ========== mmap lock VMA lock rmap lock Stable? Read? Write most? Write all? ========= ======== ========= ======= ===== =========== ========== - - - N N N N - R - Y Y N N - - R/W Y Y N N R/W -/R -/R/W Y Y N N W W -/R Y Y Y N W W W Y Y Y Y ========= ======== ========= ======= ===== =========== ==========

.. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock, attempting to do the reverse is invalid as it can result in deadlock - if another task already holds an mmap write lock and attempts to acquire a VMA write lock that will deadlock on the VMA read lock.

All of these locks behave as read/write semaphores in practice, so you can obtain either a read or a write lock for each of these.

.. note:: Generally speaking, a read/write semaphore is a class of lock which permits concurrent readers. However a write lock can only be obtained once all readers have left the critical region (and pending readers made to wait).

      This renders read locks on a read/write semaphore concurrent with other
      readers and write locks exclusive against all others holding the semaphore.

VMA fields ^^^^^^^^^^

We can subdivide :c:struct:!struct vm_area_struct fields by their purpose, which makes it easier to explore their locking characteristics:

.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these are in effect an internal implementation detail.

.. table:: Virtual layout fields

===================== ======================================== =========== Field Description Write lock ===================== ======================================== =========== :c:member:!vm_start Inclusive start virtual address of range mmap write, VMA describes. VMA write, rmap write. :c:member:!vm_end Exclusive end virtual address of range mmap write, VMA describes. VMA write, rmap write. :c:member:!vm_pgoff Describes the page offset into the file, mmap write, the original page offset within the VMA write, virtual address space (prior to any rmap write. :c:func:!mremap), or PFN if a PFN map and the architecture does not support :c:macro:!CONFIG_ARCH_HAS_PTE_SPECIAL. ===================== ======================================== ===========

These fields describes the size, start and end of the VMA, and as such cannot be modified without first being hidden from the reverse mapping since these fields are used to locate VMAs within the reverse mapping interval trees.

.. table:: Core fields

============================ ======================================== ========================= Field Description Write lock ============================ ======================================== ========================= :c:member:!vm_mm Containing mm_struct. None - written once on initial map. :c:member:!vm_page_prot Architecture-specific page table mmap write, VMA write. protection bits determined from VMA flags. :c:member:!vm_flags Read-only access to VMA flags describing N/A attributes of the VMA, in union with private writable :c:member:!__vm_flags. :c:member:!__vm_flags Private, writable access to VMA flags mmap write, VMA write. field, updated by :c:func:!vm_flags_* functions. :c:member:!vm_file If the VMA is file-backed, points to a None - written once on struct file object describing the initial map. underlying file, if anonymous then :c:macro:!NULL. :c:member:!vm_ops If the VMA is file-backed, then either None - Written once on the driver or file-system provides a initial map by :c:struct:!struct vm_operations_struct :c:func:!f_ops->mmap(). object describing callbacks to be invoked on VMA lifetime events. :c:member:!vm_private_data A :c:member:!void * field for Handled by driver. driver-specific metadata. ============================ ======================================== =========================

These are the core fields which describe the MM the VMA belongs to and its attributes.

.. table:: Config-specific fields

================================= ===================== ======================================== =============== Field Configuration option Description Write lock ================================= ===================== ======================================== =============== :c:member:!anon_name CONFIG_ANON_VMA_NAME A field for storing a mmap write, :c:struct:!struct anon_vma_name VMA write. object providing a name for anonymous mappings, or :c:macro:!NULL if none is set or the VMA is file-backed. The underlying object is reference counted and can be shared across multiple VMAs for scalability. :c:member:!swap_readahead_info CONFIG_SWAP Metadata used by the swap mechanism mmap read, to perform readahead. This field is swap-specific accessed atomically. lock. :c:member:!vm_policy CONFIG_NUMA :c:type:!mempolicy object which mmap write, describes the NUMA behaviour of the VMA write. VMA. The underlying object is reference counted. :c:member:!numab_state CONFIG_NUMA_BALANCING :c:type:!vma_numab_state object which mmap read, describes the current state of numab-specific NUMA balancing in relation to this VMA. lock. Updated under mmap read lock by :c:func:!task_numa_work. :c:member:!vm_userfaultfd_ctx CONFIG_USERFAULTFD Userfaultfd context wrapper object of mmap write, type :c:type:!vm_userfaultfd_ctx, VMA write. either of zero size if userfaultfd is disabled, or containing a pointer to an underlying :c:type:!userfaultfd_ctx object which describes userfaultfd metadata. ================================= ===================== ======================================== ===============

These fields are present or not depending on whether the relevant kernel configuration option is set.

.. table:: Reverse mapping fields

=================================== ========================================= ============================ Field Description Write lock =================================== ========================================= ============================ :c:member:!shared.rb A red/black tree node used, if the mmap write, VMA write, mapping is file-backed, to place the VMA i_mmap write. in the :c:member:!struct address_space->i_mmap red/black interval tree. :c:member:!shared.rb_subtree_last Metadata used for management of the mmap write, VMA write, interval tree if the VMA is file-backed. i_mmap write. :c:member:!anon_vma_chain List of pointers to both forked/CoW’d mmap read, anon_vma write. :c:type:!anon_vma objects and :c:member:!vma->anon_vma if it is non-:c:macro:!NULL. :c:member:!anon_vma :c:type:!anon_vma object used by When :c:macro:NULL and anonymous folios mapped exclusively to setting non-:c:macro:NULL: this VMA. Initially set by mmap read, page_table_lock. :c:func:!anon_vma_prepare serialised by the :c:macro:!page_table_lock. This When non-:c:macro:NULL and is set as soon as any page is faulted in. setting :c:macro:NULL: mmap write, VMA write, anon_vma write. =================================== ========================================= ============================

These fields are used to both place the VMA within the reverse mapping, and for anonymous mappings, to be able to access both related :c:struct:!struct anon_vma objects and the :c:struct:!struct anon_vma in which folios mapped exclusively to this VMA should reside.

.. note:: If a file-backed mapping is mapped with :c:macro:!MAP_PRIVATE set then it can be in both the :c:type:!anon_vma and :c:type:!i_mmap trees at the same time, so all of these fields might be utilised at once.

Page tables

We won't speak exhaustively on the subject but broadly speaking, page tables map virtual addresses to physical ones through a series of page tables, each of which contain entries with physical addresses for the next page table level (along with flags), and at the leaf level the physical addresses of the underlying physical data pages or a special entry such as a swap entry, migration entry or other special marker. Offsets into these pages are provided by the virtual address itself.

In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge pages might eliminate one or two of these levels, but when this is the case we typically refer to the leaf level as the PTE level regardless.

.. note:: In instances where the architecture supports fewer page tables than five the kernel cleverly 'folds' page table levels, that is stubbing out functions related to the skipped levels. This allows us to conceptually act as if there were always five levels, even if the compiler might, in practice, eliminate any code relating to missing ones.

There are four key operations typically performed on page tables:

Traversing page tables - Simply reading page tables in order to traverse them. This only requires that the VMA is kept stable, so a lock which establishes this suffices for traversal (there are also lockless variants which eliminate even this requirement, such as :c:func:!gup_fast). There is also a special case of page table traversal for non-VMA regions which we consider separately below.
Installing page table mappings - Whether creating a new mapping or modifying an existing one in such a way as to change its identity. This requires that the VMA is kept stable via an mmap or VMA lock (explicitly not rmap locks).
Zapping/unmapping page table entries - This is what the kernel calls clearing page table mappings at the leaf level only, whilst leaving all page tables in place. This is a very common operation in the kernel performed on file truncation, the :c:macro:!MADV_DONTNEED operation via :c:func:!madvise, and others. This is performed by a number of functions including :c:func:!unmap_mapping_range and :c:func:!unmap_mapping_pages. The VMA need only be kept stable for this operation.
Freeing page tables - When finally the kernel removes page tables from a userland process (typically via :c:func:!free_pgtables) extreme care must be taken to ensure this is done safely, as this logic finally frees all page tables in the specified range, ignoring existing leaf entries (it assumes the caller has both zapped the range and prevented any further faults or modifications within it).

.. note:: Modifying mappings for reclaim or migration is performed under rmap lock as it, like zapping, does not fundamentally modify the identity of what is being mapped.

Traversing and zapping ranges can be performed holding any one of the locks described in the terminology section above - that is the mmap lock, the VMA lock or either of the reverse mapping locks.

That is - as long as you keep the relevant VMA stable - you are good to go ahead and perform these operations on page tables (though internally, kernel operations that perform writes also acquire internal page table locks to serialise - see the page table implementation detail section for more details).

.. note:: We free empty PTE tables on zap under the RCU lock - this does not change the aforementioned locking requirements around zapping.

When installing page table entries, the mmap or VMA lock must be held to keep the VMA stable. We explore why this is in the page table locking details section below.

Freeing page tables is an entirely internal memory management operation and has special requirements (see the page freeing section below for more details).

.. warning:: When freeing page tables, it must not be possible for VMAs containing the ranges those page tables map to be accessible via the reverse mapping.

         The :c:func:`!free_pgtables` function removes the relevant VMAs
         from the reverse mappings, but no other VMAs can be permitted to be
         accessible and span the specified range.

Traversing non-VMA page tables

We've focused above on traversal of page tables belonging to VMAs. It is also possible to traverse page tables which are not represented by VMAs.

Kernel page table mappings themselves are generally managed but whatever part of the kernel established them and the aforementioned locking rules do not apply - for instance vmalloc has its own set of locks which are utilised for establishing and tearing down page its page tables.

However, for convenience we provide the :c:func:!walk_kernel_page_table_range function which is synchronised via the mmap lock on the :c:macro:!init_mm kernel instantiation of the :c:struct:!struct mm_struct metadata object.

If an operation requires exclusive access, a write lock is used, but if not, a read lock suffices - we assert only that at least a read lock has been acquired.

Since, aside from vmalloc and memory hot plug, kernel page tables are not torn down all that often - this usually suffices, however any caller of this functionality must ensure that any additionally required locks are acquired in advance.

We also permit a truly unusual case is the traversal of non-VMA ranges in userland ranges, as provided for by :c:func:!walk_page_range_debug.

This has only one user - the general page table dumping logic (implemented in :c:macro:!mm/ptdump.c) - which seeks to expose all mappings for debug purposes even if they are highly unusual (possibly architecture-specific) and are not backed by a VMA.

We must take great care in this case, as the :c:func:!munmap implementation detaches VMAs under an mmap write lock before tearing down page tables under a downgraded mmap read lock.

This means such an operation could race with this, and thus an mmap write lock is required.

Lock ordering

As we have multiple locks across the kernel which may or may not be taken at the same time as explicit mm or VMA locks, we have to be wary of lock inversion, and the order in which locks are acquired and released becomes very important.

.. note:: Lock inversion occurs when two threads need to acquire multiple locks, but in doing so inadvertently cause a mutual deadlock.

For example, consider thread 1 which holds lock A and tries to acquire lock B, while thread 2 holds lock B and tries to acquire lock A.

Both threads are now deadlocked on each other. However, had they attempted to acquire locks in the same order, one would have waited for the other to complete its work and no deadlock would have occurred.

The opening comment in :c:macro:!mm/rmap.c describes in detail the required ordering of locks within memory management code:

.. code-block::

inode->i_rwsem (while writing or truncating, not reading or faulting) mm->mmap_lock mapping->invalidate_lock (in filemap_fault) folio_lock hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below) vma_start_write mapping->i_mmap_rwsem anon_vma->rwsem mm->page_table_lock or pte_lock swap_lock (in swap_duplicate, swap_info_get) mmlist_lock (in mmput, drain_mmlist and others) mapping->private_lock (in block_dirty_folio) i_pages lock (widely used) lruvec->lru_lock (in folio_lruvec_lock_irq) inode->i_lock (in set_page_dirty's __mark_inode_dirty) bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) sb_lock (within inode_lock in fs/fs-writeback.c) i_pages lock (widely used, in set_page_dirty, in arch-dependent flush_dcache_mmap_lock, within bdi.wb->list_lock in __sync_single_inode)

There is also a file-system specific lock ordering comment located at the top of :c:macro:!mm/filemap.c:

.. code-block::

->i_mmap_rwsem (truncate_pagecache) ->private_lock (__free_pte->block_dirty_folio) ->swap_lock (exclusive_swap_page, others) ->i_pages lock

->i_rwsem ->invalidate_lock (acquired by fs in truncate path) ->i_mmap_rwsem (truncate->unmap_mapping_range)

->mmap_lock ->i_mmap_rwsem ->page_table_lock or pte_lock (various, mainly in memory.c) ->i_pages lock (arch-dependent flush_dcache_mmap_lock)

->mmap_lock ->invalidate_lock (filemap_fault) ->lock_page (filemap_fault, access_process_vm)

->i_rwsem (generic_perform_write) ->mmap_lock (fault_in_readable->do_page_fault)

bdi->wb.list_lock sb_lock (fs/fs-writeback.c) ->i_pages lock (__sync_single_inode)

->i_mmap_rwsem ->anon_vma.lock (vma_merge)

->anon_vma.lock ->page_table_lock or pte_lock (anon_vma_prepare and various)

->page_table_lock or pte_lock ->swap_lock (try_to_unmap_one) ->private_lock (try_to_unmap_one) ->i_pages lock (try_to_unmap_one) ->lruvec->lru_lock (follow_page_mask->mark_page_accessed) ->lruvec->lru_lock (check_pte_range->folio_isolate_lru) ->private_lock (folio_remove_rmap_pte->set_page_dirty) ->i_pages lock (folio_remove_rmap_pte->set_page_dirty) bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty) ->inode->i_lock (folio_remove_rmap_pte->set_page_dirty) bdi.wb->list_lock (zap_pte_range->set_page_dirty) ->inode->i_lock (zap_pte_range->set_page_dirty) ->private_lock (zap_pte_range->block_dirty_folio)

Please check the current state of these comments which may have changed since the time of writing of this document.

Locking Implementation Details

.. warning:: Locking rules for PTE-level page tables are very different from locking rules for page tables at other levels.

Page table locking details

.. note:: This section explores page table locking requirements for page tables encompassed by a VMA. See the above section on non-VMA page table traversal for details on how we handle that case.

In addition to the locks described in the terminology section above, we have additional locks dedicated to page tables:

Higher level page table locks - Higher level page tables, that is PGD, P4D and PUD each make use of the process address space granularity :c:member:!mm->page_table_lock lock when modified.
Fine-grained page table locks - PMDs and PTEs each have fine-grained locks either kept within the folios describing the page tables or allocated separated and pointed at by the folios if :c:macro:!ALLOC_SPLIT_PTLOCKS is set. The PMD spin lock is obtained via :c:func:!pmd_lock, however PTEs are mapped into higher memory (if a 32-bit system) and carefully locked via :c:func:!pte_offset_map_lock.

These locks represent the minimum required to interact with each page table level, but there are further requirements.

Importantly, note that on a traversal of page tables, sometimes no such locks are taken. However, at the PTE level, at least concurrent page table deletion must be prevented (using RCU) and the page table must be mapped into high memory, see below.

Whether care is taken on reading the page table entries depends on the architecture, see the section on atomicity below.

Locking rules ^^^^^^^^^^^^^

We establish basic locking rules when interacting with page tables:

When changing a page table entry the page table lock for that page table must be held, except if you can safely assume nobody can access the page tables concurrently (such as on invocation of :c:func:!free_pgtables).
Reads from and writes to page table entries must be appropriately atomic. See the section on atomicity below for details.
Populating previously empty entries requires that the mmap or VMA locks are held (read or write), doing so with only rmap locks would be dangerous (see the warning below).
As mentioned previously, zapping can be performed while simply keeping the VMA stable, that is holding any one of the mmap, VMA or rmap locks.

.. warning:: Populating previously empty entries is dangerous as, when unmapping VMAs, :c:func:!vms_clear_ptes has a window of time between zapping (via :c:func:!unmap_vmas) and freeing page tables (via :c:func:!free_pgtables), where the VMA is still visible in the rmap tree. :c:func:!free_pgtables assumes that the zap has already been performed and removes PTEs unconditionally (along with all other page tables in the freed range), so installing new PTE entries could leak memory and also cause other unexpected and dangerous behaviour.

There are additional rules applicable when moving page tables, which we discuss in the section on this topic below.

PTE-level page tables are different from page tables at other levels, and there are extra requirements for accessing them:

On 32-bit architectures, they may be in high memory (meaning they need to be mapped into kernel memory to be accessible).
When empty, they can be unlinked and RCU-freed while holding an mmap lock or rmap lock for reading in combination with the PTE and PMD page table locks. In particular, this happens in :c:func:!retract_page_tables when handling :c:macro:!MADV_COLLAPSE. So accessing PTE-level page tables requires at least holding an RCU read lock; but that only suffices for readers that can tolerate racing with concurrent page table updates such that an empty PTE is observed (in a page table that has actually already been detached and marked for RCU freeing) while another new page table has been installed in the same location and filled with entries. Writers normally need to take the PTE lock and revalidate that the PMD entry still refers to the same PTE-level page table. If the writer does not care whether it is the same PTE-level page table, it can take the PMD lock and revalidate that the contents of pmd entry still meet the requirements. In particular, this also happens in :c:func:!retract_page_tables when handling :c:macro:!MADV_COLLAPSE.

To access PTE-level page tables, a helper like :c:func:!pte_offset_map_lock or :c:func:!pte_offset_map can be used depending on stability requirements. These map the page table into kernel memory if required, take the RCU lock, and depending on variant, may also look up or acquire the PTE lock. See the comment on :c:func:!pte_offset_map_lock.

Atomicity ^^^^^^^^^

Regardless of page table locks, the MMU hardware concurrently updates accessed and dirty bits (perhaps more, depending on architecture). Additionally, page table traversal operations in parallel (though holding the VMA stable) and functionality like GUP-fast locklessly traverses (that is reads) page tables, without even keeping the VMA stable at all.

When performing a page table traversal and keeping the VMA stable, whether a read must be performed once and only once or not depends on the architecture (for instance x86-64 does not require any special precautions).

If a write is being performed, or if a read informs whether a write takes place (on an installation of a page table entry say, for instance in :c:func:!__pud_install), special care must always be taken. In these cases we can never assume that page table locks give us entirely exclusive access, and must retrieve page table entries once and only once.

If we are reading page table entries, then we need only ensure that the compiler does not rearrange our loads. This is achieved via :c:func:!pXXp_get functions - :c:func:!pgdp_get, :c:func:!p4dp_get, :c:func:!pudp_get, :c:func:!pmdp_get, and :c:func:!ptep_get.

Each of these uses :c:func:!READ_ONCE to guarantee that the compiler reads the page table entry only once.

However, if we wish to manipulate an existing page table entry and care about the previously stored data, we must go further and use an hardware atomic operation as, for example, in :c:func:!ptep_get_and_clear.

Equally, operations that do not rely on the VMA being held stable, such as GUP-fast (see :c:func:!gup_fast and its various page table level handlers like :c:func:!gup_fast_pte_range), must very carefully interact with page table entries, using functions such as :c:func:!ptep_get_lockless and equivalent for higher level page table levels.

Writes to page table entries must also be appropriately atomic, as established by :c:func:!set_pXX functions - :c:func:!set_pgd, :c:func:!set_p4d, :c:func:!set_pud, :c:func:!set_pmd, and :c:func:!set_pte.

Equally functions which clear page table entries must be appropriately atomic, as in :c:func:!pXX_clear functions - :c:func:!pgd_clear, :c:func:!p4d_clear, :c:func:!pud_clear, :c:func:!pmd_clear, and :c:func:!pte_clear.

Page table installation ^^^^^^^^^^^^^^^^^^^^^^^

Page table installation is performed with the VMA held stable explicitly by an mmap or VMA lock in read or write mode (see the warning in the locking rules section for details as to why).

When allocating a P4D, PUD or PMD and setting the relevant entry in the above PGD, P4D or PUD, the :c:member:!mm->page_table_lock must be held. This is acquired in :c:func:!__p4d_alloc, :c:func:!__pud_alloc and :c:func:!__pmd_alloc respectively.

.. note:: :c:func:!__pmd_alloc actually invokes :c:func:!pud_lock and :c:func:!pud_lockptr in turn, however at the time of writing it ultimately references the :c:member:!mm->page_table_lock.

Allocating a PTE will either use the :c:member:!mm->page_table_lock or, if :c:macro:!USE_SPLIT_PMD_PTLOCKS is defined, a lock embedded in the PMD physical page metadata in the form of a :c:struct:!struct ptdesc, acquired by :c:func:!pmd_ptdesc called from :c:func:!pmd_lock and ultimately :c:func:!__pte_alloc.

Finally, modifying the contents of the PTE requires special treatment, as the PTE page table lock must be acquired whenever we want stable and exclusive access to entries contained within a PTE, especially when we wish to modify them.

This is performed via :c:func:!pte_offset_map_lock which carefully checks to ensure that the PTE hasn't changed from under us, ultimately invoking :c:func:!pte_lockptr to obtain a spin lock at PTE granularity contained within the :c:struct:!struct ptdesc associated with the physical PTE page. The lock must be released via :c:func:!pte_unmap_unlock.

.. note:: There are some variants on this, such as :c:func:!pte_offset_map_rw_nolock when we know we hold the PTE stable but for brevity we do not explore this. See the comment for :c:func:!pte_offset_map_lock for more details.

When modifying data in ranges we typically only wish to allocate higher page tables as necessary, using these locks to avoid races or overwriting anything, and set/clear data at the PTE level as required (for instance when page faulting or zapping).

A typical pattern taken when traversing page table entries to install a new mapping is to optimistically determine whether the page table entry in the table above is empty, if so, only then acquiring the page table lock and checking again to see if it was allocated underneath us.

This allows for a traversal with page table locks only being taken when required. An example of this is :c:func:!__pud_alloc.

At the leaf page table, that is the PTE, we can't entirely rely on this pattern as we have separate PMD and PTE locks and a THP collapse for instance might have eliminated the PMD entry as well as the PTE from under us.

This is why :c:func:!pte_offset_map_lock locklessly retrieves the PMD entry for the PTE, carefully checking it is as expected, before acquiring the PTE-specific lock, and then again checking that the PMD entry is as expected.

If a THP collapse (or similar) were to occur then the lock on both pages would be acquired, so we can ensure this is prevented while the PTE lock is held.

Installing entries this way ensures mutual exclusion on write.

Page table freeing ^^^^^^^^^^^^^^^^^^

Tearing down page tables themselves is something that requires significant care. There must be no way that page tables designated for removal can be traversed or referenced by concurrent tasks.

It is insufficient to simply hold an mmap write lock and VMA lock (which will prevent racing faults, and rmap operations), as a file-backed mapping can be truncated under the :c:struct:!struct address_space->i_mmap_rwsem alone.

As a result, no VMA which can be accessed via the reverse mapping (either through the :c:struct:!struct anon_vma->rb_root or the :c:member:!struct address_space->i_mmap interval trees) can have its page tables torn down.

The operation is typically performed via :c:func:!free_pgtables, which assumes either the mmap write lock has been taken (as specified by its :c:member:!mm_wr_locked parameter), or that the VMA is already unreachable.

It carefully removes the VMA from all reverse mappings, however it's important that no new ones overlap these or any route remain to permit access to addresses within the range whose page tables are being torn down.

Additionally, it assumes that a zap has already been performed and steps have been taken to ensure that no further page table entries can be installed between the zap and the invocation of :c:func:!free_pgtables.

Since it is assumed that all such steps have been taken, page table entries are cleared without page table locks (in the :c:func:!pgd_clear, :c:func:!p4d_clear, :c:func:!pud_clear, and :c:func:!pmd_clear functions.

.. note:: It is possible for leaf page tables to be torn down independent of the page tables above it as is done by :c:func:!retract_page_tables, which is performed under the i_mmap read lock, PMD, and PTE page table locks, without this level of care.

Page table moving ^^^^^^^^^^^^^^^^^

Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD page tables). Most notable of these is :c:func:!mremap, which is capable of moving higher level page tables.

In these instances, it is required that all locks are taken, that is the mmap lock, the VMA lock and the relevant rmap locks.

You can observe this in the :c:func:!mremap implementation in the functions :c:func:!take_rmap_locks and :c:func:!drop_rmap_locks which perform the rmap side of lock acquisition, invoked ultimately by :c:func:!move_page_tables.

VMA lock internals

Overview ^^^^^^^^

VMA read locking is entirely optimistic - if the lock is contended or a competing write has started, then we do not obtain a read lock.

A VMA read lock is obtained by :c:func:!lock_vma_under_rcu, which first calls :c:func:!rcu_read_lock to ensure that the VMA is looked up in an RCU critical section, then attempts to VMA lock it via :c:func:!vma_start_read, before releasing the RCU lock via :c:func:!rcu_read_unlock.

In cases when the user already holds mmap read lock, :c:func:!vma_start_read_locked and :c:func:!vma_start_read_locked_nested can be used. These functions do not fail due to lock contention but the caller should still check their return values in case they fail for other reasons.

VMA read locks increment :c:member:!vma.vm_refcnt reference counter for their duration and the caller of :c:func:!lock_vma_under_rcu must drop it via :c:func:!vma_end_read.

VMA write locks are acquired via :c:func:!vma_start_write in instances where a VMA is about to be modified, unlike :c:func:!vma_start_read the lock is always acquired. An mmap write lock must be held for the duration of the VMA write lock, releasing or downgrading the mmap write lock also releases the VMA write lock so there is no :c:func:!vma_end_write function.

Note that when write-locking a VMA lock, the :c:member:!vma.vm_refcnt is temporarily modified so that readers can detect the presense of a writer. The reference counter is restored once the vma sequence number used for serialisation is updated.

This ensures the semantics we require - VMA write locks provide exclusive write access to the VMA.

Implementation details ^^^^^^^^^^^^^^^^^^^^^^

The VMA lock mechanism is designed to be a lightweight means of avoiding the use of the heavily contended mmap lock. It is implemented using a combination of a reference counter and sequence numbers belonging to the containing :c:struct:!struct mm_struct and the VMA.

Read locks are acquired via :c:func:!vma_start_read, which is an optimistic operation, i.e. it tries to acquire a read lock but returns false if it is unable to do so. At the end of the read operation, :c:func:!vma_end_read is called to release the VMA read lock.

Invoking :c:func:!vma_start_read requires that :c:func:!rcu_read_lock has been called first, establishing that we are in an RCU critical section upon VMA read lock acquisition. Once acquired, the RCU lock can be released as it is only required for lookup. This is abstracted by :c:func:!lock_vma_under_rcu which is the interface a user should use.

Writing requires the mmap to be write-locked and the VMA lock to be acquired via :c:func:!vma_start_write, however the write lock is released by the termination or downgrade of the mmap write lock so no :c:func:!vma_end_write is required.

All this is achieved by the use of per-mm and per-VMA sequence counts, which are used in order to reduce complexity, especially for operations which write-lock multiple VMAs at once.

If the mm sequence count, :c:member:!mm->mm_lock_seq is equal to the VMA sequence count :c:member:!vma->vm_lock_seq then the VMA is write-locked. If they differ, then it is not.

Each time the mmap write lock is released in :c:func:!mmap_write_unlock or :c:func:!mmap_write_downgrade, :c:func:!vma_end_write_all is invoked which also increments :c:member:!mm->mm_lock_seq via :c:func:!mm_lock_seqcount_end.

This way, we ensure that, regardless of the VMA's sequence number, a write lock is never incorrectly indicated and that when we release an mmap write lock we efficiently release all VMA write locks contained within the mmap at the same time.

Since the mmap write lock is exclusive against others who hold it, the automatic release of any VMA locks on its release makes sense, as you would never want to keep VMAs locked across entirely separate write operations. It also maintains correct lock ordering.

Each time a VMA read lock is acquired, we increment :c:member:!vma.vm_refcnt reference counter and check that the sequence count of the VMA does not match that of the mm.

If it does, the read lock fails and :c:member:!vma.vm_refcnt is dropped. If it does not, we keep the reference counter raised, excluding writers, but permitting other readers, who can also obtain this lock under RCU.

Importantly, maple tree operations performed in :c:func:!lock_vma_under_rcu are also RCU safe, so the whole read lock operation is guaranteed to function correctly.

On the write side, we set a bit in :c:member:!vma.vm_refcnt which can't be modified by readers and wait for all readers to drop their reference count. Once there are no readers, the VMA's sequence number is set to match that of the mm. During this entire operation mmap write lock is held.

This way, if any read locks are in effect, :c:func:!vma_start_write will sleep until these are finished and mutual exclusion is achieved.

After setting the VMA's sequence number, the bit in :c:member:!vma.vm_refcnt indicating a writer is cleared. From this point on, VMA's sequence number will indicate VMA's write-locked state until mmap write lock is dropped or downgraded.

This clever combination of a reference counter and sequence count allows for fast RCU-based per-VMA lock acquisition (especially on page fault, though utilised elsewhere) with minimal complexity around lock ordering.

mmap write lock downgrading

When an mmap write lock is held one has exclusive access to resources within the mmap (with the usual caveats about requiring VMA write locks to avoid races with tasks holding VMA read locks).

It is then possible to downgrade from a write lock to a read lock via :c:func:!mmap_write_downgrade which, similar to :c:func:!mmap_write_unlock, implicitly terminates all VMA write locks via :c:func:!vma_end_write_all, but importantly does not relinquish the mmap lock while downgrading, therefore keeping the locked virtual address space stable.

An interesting consequence of this is that downgraded locks are exclusive against any other task possessing a downgraded lock (since a racing task would have to acquire a write lock first to downgrade it, and the downgraded lock prevents a new write lock from being obtained until the original lock is released).

For clarity, we map read (R)/downgraded write (D)/write (W) locks against one another showing which locks exclude the others:

.. list-table:: Lock exclusivity :widths: 5 5 5 5 :header-rows: 1 :stub-columns: 1

- R
- D
- W
- R
- N
- N
- Y
- D
- N
- Y
- Y
- W
- Y
- Y
- Y

Here a Y indicates the locks in the matching row/column are mutually exclusive, and N indicates that they are not.

Stack expansion

Stack expansion throws up additional complexities in that we cannot permit there to be racing page faults, as a result we invoke :c:func:!vma_start_write to prevent this in :c:func:!expand_downwards or :c:func:!expand_upwards.

Functions and structures

.. kernel-doc:: include/linux/mmap_lock.h