lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250604180308.137116-1-lorenzo.stoakes@oracle.com>
Date: Wed,  4 Jun 2025 19:03:08 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Suren Baghdasaryan <surenb@...gle.com>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Shakeel Butt <shakeel.butt@...ux.dev>,
        Jonathan Corbet <corbet@....net>, Jann Horn <jannh@...gle.com>,
        Qi Zheng <zhengqi.arch@...edance.com>, linux-mm@...ck.org,
        linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: [PATCH v2] docs/mm: expand vma doc to highlight pte freeing, non-vma traversal

The process addresses documentation already contains a great deal of
information about mmap/VMA locking and page table traversal and
manipulation.

However it waves it hands about non-VMA traversal. Add a section for this
and explain the caveats around this kind of traversal.

Additionally, commit 6375e95f381e ("mm: pgtable: reclaim empty PTE page in
madvise(MADV_DONTNEED)") caused zapping to also free empty PTE page
tables. Highlight this.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
---
v2:
* Replaced references to walk_page_range_novma() with
  walk_kernel_page_table_range() and walk_page_range_debug() as necessary.
* Dropped references to v6.14 and the commit that introduces PTE reclaim on
  zap as per Jon.
* Added additional reference about freeing PTE page tables tables when
  zapping under RCU.
* Clarified kernel page table locking as per Jann.
* Dropped the warning about zapping and PTE page table removal - it was too
  vague anyway, but it seems that VMA lock acquisition in this scenario is
  not required.
* I will address the issues Jon raised re: markup in follow up series.

v1:
https://lore.kernel.org/all/20250602210710.106159-1-lorenzo.stoakes@oracle.com/

 Documentation/mm/process_addrs.rst | 54 ++++++++++++++++++++++++++----
 1 file changed, 48 insertions(+), 6 deletions(-)

diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
index e6756e78b476..be49e2a269e4 100644
--- a/Documentation/mm/process_addrs.rst
+++ b/Documentation/mm/process_addrs.rst
@@ -303,7 +303,9 @@ There are four key operations typically performed on page tables:
 1. **Traversing** page tables - Simply reading page tables in order to traverse
    them. This only requires that the VMA is kept stable, so a lock which
    establishes this suffices for traversal (there are also lockless variants
-   which eliminate even this requirement, such as :c:func:`!gup_fast`).
+   which eliminate even this requirement, such as :c:func:`!gup_fast`). There is
+   also a special case of page table traversal for non-VMA regions which we
+   consider separately below.
 2. **Installing** page table mappings - Whether creating a new mapping or
    modifying an existing one in such a way as to change its identity. This
    requires that the VMA is kept stable via an mmap or VMA lock (explicitly not
@@ -335,15 +337,13 @@ ahead and perform these operations on page tables (though internally, kernel
 operations that perform writes also acquire internal page table locks to
 serialise - see the page table implementation detail section for more details).

+.. note:: We free empty PTE tables on zap under the RCU lock - this does not
+          change the aforementioned locking requirements around zapping.
+
 When **installing** page table entries, the mmap or VMA lock must be held to
 keep the VMA stable. We explore why this is in the page table locking details
 section below.

-.. warning:: Page tables are normally only traversed in regions covered by VMAs.
-             If you want to traverse page tables in areas that might not be
-             covered by VMAs, heavier locking is required.
-             See :c:func:`!walk_page_range_novma` for details.
-
 **Freeing** page tables is an entirely internal memory management operation and
 has special requirements (see the page freeing section below for more details).

@@ -355,6 +355,44 @@ has special requirements (see the page freeing section below for more details).
              from the reverse mappings, but no other VMAs can be permitted to be
              accessible and span the specified range.

+Traversing non-VMA page tables
+------------------------------
+
+We've focused above on traversal of page tables belonging to VMAs. It is also
+possible to traverse page tables which are not represented by VMAs.
+
+Kernel page table mappings themselves are generally managed but whatever part of
+the kernel established them and the aforementioned locking rules do not apply -
+for instance vmalloc has its own set of locks which are utilised for
+establishing and tearing down page its page tables.
+
+However, for convenience we provide the :c:func:`!walk_kernel_page_table_range`
+function which is synchronised via the mmap lock on the :c:macro:`!init_mm`
+kernel instantiation of the :c:struct:`!struct mm_struct` metadata object.
+
+If an operation requires exclusive access, a write lock is used, but if not, a
+read lock suffices - we assert only that at least a read lock has been acquired.
+
+Since, aside from vmalloc and memory hot plug, kernel page tables are not torn
+down all that often - this usually suffices, however any caller of this
+functionality must ensure that any additionally required locks are acquired in
+advance.
+
+We also permit a truly unusual case is the traversal of non-VMA ranges in
+**userland** ranges, as provided for by :c:func:`!walk_page_range_debug`.
+
+This has only one user - the general page table dumping logic (implemented in
+:c:macro:`!mm/ptdump.c`) - which seeks to expose all mappings for debug purposes
+even if they are highly unusual (possibly architecture-specific) and are not
+backed by a VMA.
+
+We must take great care in this case, as the :c:func:`!munmap` implementation
+detaches VMAs under an mmap write lock before tearing down page tables under a
+downgraded mmap read lock.
+
+This means such an operation could race with this, and thus an mmap **write**
+lock is required.
+
 Lock ordering
 -------------

@@ -461,6 +499,10 @@ Locking Implementation Details
 Page table locking details
 --------------------------

+.. note:: This section explores page table locking requirements for page tables
+          encompassed by a VMA. See the above section on non-VMA page table
+          traversal for details on how we handle that case.
+
 In addition to the locks described in the terminology section above, we have
 additional locks dedicated to page tables:

--
2.49.0

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ