[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f08104bc-8ef2-4f6d-8d23-e5a087595a40@lucifer.local>
Date: Tue, 3 Jun 2025 11:45:21 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Jann Horn <jannh@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Suren Baghdasaryan <surenb@...gle.com>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>,
Shakeel Butt <shakeel.butt@...ux.dev>,
Jonathan Corbet <corbet@....net>,
Qi Zheng <zhengqi.arch@...edance.com>, linux-mm@...ck.org,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] docs/mm: expand vma doc to highlight pte freeing,
non-vma traversal
On Tue, Jun 03, 2025 at 12:25:36AM +0200, Jann Horn wrote:
> On Mon, Jun 2, 2025 at 11:07 PM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
> > The process addresses documentation already contains a great deal of
> > information about mmap/VMA locking and page table traversal and
> > manipulation.
> >
> > However it waves it hands about non-VMA traversal. Add a section for this
> > and explain the caveats around this kind of traversal.
> >
> > Additionally, commit 6375e95f381e ("mm: pgtable: reclaim empty PTE page in
> > madvise(MADV_DONTNEED)") caused zapping to also free empty PTE page
> > tables. Highlight this and reference how this impacts ptdump non-VMA
> > traversal of userland mappings.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> > ---
> > Documentation/mm/process_addrs.rst | 58 ++++++++++++++++++++++++++----
> > 1 file changed, 52 insertions(+), 6 deletions(-)
> >
> > diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
> > index e6756e78b476..83166c2b47dc 100644
> > --- a/Documentation/mm/process_addrs.rst
> > +++ b/Documentation/mm/process_addrs.rst
> > @@ -303,7 +303,9 @@ There are four key operations typically performed on page tables:
> > 1. **Traversing** page tables - Simply reading page tables in order to traverse
> > them. This only requires that the VMA is kept stable, so a lock which
> > establishes this suffices for traversal (there are also lockless variants
> > - which eliminate even this requirement, such as :c:func:`!gup_fast`).
> > + which eliminate even this requirement, such as :c:func:`!gup_fast`). There is
> > + also a special case of page table traversal for non-VMA regions which we
> > + consider separately below.
> > 2. **Installing** page table mappings - Whether creating a new mapping or
> > modifying an existing one in such a way as to change its identity. This
> > requires that the VMA is kept stable via an mmap or VMA lock (explicitly not
> > @@ -335,15 +337,14 @@ ahead and perform these operations on page tables (though internally, kernel
> > operations that perform writes also acquire internal page table locks to
> > serialise - see the page table implementation detail section for more details).
> >
> > +.. note:: Since v6.14 and commit 6375e95f381e ("mm: pgtable: reclaim empty PTE
> > + page in madvise (MADV_DONTNEED)"), we now also free empty PTE tables
> > + on zap. This does not change zapping locking requirements.
> > +
> > When **installing** page table entries, the mmap or VMA lock must be held to
> > keep the VMA stable. We explore why this is in the page table locking details
> > section below.
> >
> > -.. warning:: Page tables are normally only traversed in regions covered by VMAs.
> > - If you want to traverse page tables in areas that might not be
> > - covered by VMAs, heavier locking is required.
> > - See :c:func:`!walk_page_range_novma` for details.
> > -
> > **Freeing** page tables is an entirely internal memory management operation and
> > has special requirements (see the page freeing section below for more details).
> >
> > @@ -355,6 +356,47 @@ has special requirements (see the page freeing section below for more details).
> > from the reverse mappings, but no other VMAs can be permitted to be
> > accessible and span the specified range.
> >
> > +Traversing non-VMA page tables
> > +------------------------------
> > +
> > +We've focused above on traversal of page tables belonging to VMAs. It is also
> > +possible to traverse page tables which are not represented by VMAs.
> > +
> > +Primarily this is used to traverse kernel page table mappings. In which case one
> > +must hold an mmap **read** lock on the :c:macro:`!init_mm` kernel instantiation
> > +of the :c:struct:`!struct mm_struct` metadata object, as performed in
> > +:c:func:`walk_page_range_novma`.
>
> My understanding is that kernel page tables are walked with no MM
> locks held all the time. See for example:
>
> - vmalloc_to_page()
> - vmap()
> - KASAN's shadow_mapped()
> - apply_to_page_range() called from kasan_populate_vmalloc() or
> arm64's set_direct_map_invalid_noflush()
I explicitly mention vmalloc :) and the others are I guess the
'exceptions'.
Perhaps the better way of phrasing this though is to say
'when using this inteface, we stabilise on the mmap read lock, however
there's no guarantees anything else in the kernel will behave like this, we
rely on these page tables not being freed. In cases where they might be -
make sure you have the right locks'.
Which is what it's _kinda_ saying kinda. But maybe not clearly enough...
>
> This is possible because kernel-internal page tables are used for
> allocations managed by kernel-internal users, and so things like the
> lifetimes of page tables can be guaranteed by higher-level logic.
> (Like "I own a vmalloc allocation in this range, so the page tables
> can't change until I call vfree().")
Right.
>
> The one way in which I think this is currently kinda yolo/broken is
> that vmap_try_huge_pud() can end up freeing page tables via
> pud_free_pmd_page(), while holding no MM locks AFAICS, so that could
> race with the ptdump debug logic such that ptdump walks into freed
> page tables?
But those mappings would be kernel mappings? How could ptdump walk into
those?
>
> I think the current rules for kernel page tables can be summarized as
> "every kernel subsystem can make up its own rules for its regions of
> virtual address space", which makes ptdump buggy because it can't
> follow the different rules of all subsystems; and we should probably
> change the rules to "every kernel subsystem can make up its own rules
> except please take the init_mm's mmap lock when you delete page
> tables".
>
> > +This is generally sufficient to preclude other page table walkers (excluding
> > +vmalloc regions and memory hot plug) as the intermediate kernel page tables are
> > +not usually freed.
> > +
> > +For cases where they might be then the caller has to acquire the appropriate
> > +additional locks.
> > +
> > +The truly unusual case is the traversal of non-VMA ranges in **userland**
> > +ranges.
> > +
> > +This has only one user - the general page table dumping logic (implemented in
> > +:c:macro:`!mm/ptdump.c`) - which seeks to expose all mappings for debug purposes
> > +even if they are highly unusual (possibly architecture-specific) and are not
> > +backed by a VMA.
> > +
> > +We must take great care in this case, as the :c:func:`!munmap` implementation
> > +detaches VMAs under an mmap write lock before tearing down page tables under a
> > +downgraded mmap read lock.
> > +
> > +This means such an operation could race with this, and thus an mmap **write**
> > +lock is required.
> > +
> > +.. warning:: A racing zap operation is problematic if it is performed without an
> > + exclusive lock held - since v6.14 and commit 6375e95f381e PTEs may
> > + be freed upon zap, so if this occurs the traversal might encounter
> > + the same issue seen due to :c:func:`!munmap`'s use of a downgraded
> > + mmap lock.
> > +
> > + In this instance, additional appropriate locking is required.
>
> (I think we should take all the vma locks in that ptdump code and get
> rid of this weird exception instead of documenting it.)
We really need to be sure that there aren't some weird architectures doing
weird things or circumstances where this is meaningful.
I mean people went to great lengths to make this possible, I find it hard
to believe there aren't some _weird_ real world use cases.
At any rate my soon-to-be-coming patch will separate the kernel use from
this weirdo use so we can fully isolate it.
Powered by blists - more mailing lists