Documentation/arch/x86/tlb.rst
.. SPDX-License-Identifier: GPL-2.0
When the kernel unmaps or modified the attributes of a range of memory, it has two choices:
Which method to do depends on a few things:
There is obviously no way the kernel can know all these things, especially the contents of the TLB during a given flush. The sizes of the flush will vary greatly depending on the workload as well. There is essentially no "right" point to choose.
You may be doing too many individual invalidations if you see the invlpg instruction (or instructions near it) show up high in profiles. If you believe that individual invalidations being called too often, you can lower the tunable::
/sys/kernel/debug/x86/tlb_single_page_flush_ceiling
This will cause us to do the global flush for more cases. Lowering it to 0 will disable the use of the individual flushes. Setting it to 1 is a very conservative setting and it should never need to be 0 under normal circumstances.
Despite the fact that a single individual flush on x86 is guaranteed to flush a full 2MB [1]_, hugetlbfs always uses the full flushes. THP is treated exactly the same as normal memory.
You might see invlpg inside of flush_tlb_mm_range() show up in profiles, or you can use the trace_tlb_flush() tracepoints. to determine how long the flush operations are taking.
Essentially, you are balancing the cycles you spend doing invlpg with the cycles that you spend refilling the TLB later.
You can measure how expensive TLB refills are by using performance counters and 'perf stat', like this::
perf stat -e cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/, cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/, cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/, cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/, cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/, cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
That works on an IvyBridge-era CPU (i5-3320M). Different CPUs may have differently-named counters, but they should at least be there in some form. You can use pmu-tools 'ocperf list' (https://github.com/andikleen/pmu-tools) to find the right counters for a given CPU.
.. [1] A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation" says: "One execution of INVLPG is sufficient even for a page with size greater than 4 KBytes."