[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2b0fea4c-6b01-4bac-8ec8-01fcce4fa12e@lucifer.local>
Date: Mon, 18 Nov 2024 12:35:35 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Jann Horn <jannh@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Jonathan Corbet <corbet@....net>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>, Alice Ryhl <aliceryhl@...gle.com>,
Boqun Feng <boqun.feng@...il.com>,
Matthew Wilcox <willy@...radead.org>, Mike Rapoport <rppt@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>,
Hillf Danton <hdanton@...a.com>, Qi Zheng <zhengqi.arch@...edance.com>,
SeongJae Park <sj@...nel.org>, Bagas Sanjaya <bagasdotme@...il.com>,
linux-mm@...ck.org, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org, Matteo Rizzo <matteorizzo@...gle.com>
Subject: Re: [PATCH] docs/mm: add more warnings around page table access
On Fri, Nov 15, 2024 at 08:04:52PM +0100, Jann Horn wrote:
> On Fri, Nov 15, 2024 at 7:02 PM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
> > On Thu, Nov 14, 2024 at 10:12:00PM +0100, Jann Horn wrote:
> > > Make it clearer that holding the mmap lock in read mode is not enough
> > > to traverse page tables, and that just having a stable VMA is not enough
> > > to read PTEs.
> > >
> > > Suggested-by: Matteo Rizzo <matteorizzo@...gle.com>
> > > Signed-off-by: Jann Horn <jannh@...gle.com>
> >
> > Have some queries before we move forward so would like a little more
> > clarification/perhaps putting some extra meat on the bones first.
> >
> > Broadly very glad you have done this however so it's just sorting details
> > first! :>)
> >
> > > ---
> > > @akpm: Please don't put this in your tree before Lorenzo has replied.
> > >
> > > @Lorenzo:
> > > This is intended to go on top of your documentation patch.
> > > If you think this is a sensible change, do you prefer to squash it into
> > > your patch or do you prefer having akpm take this as a separate patch?
> > > IDK what works better...
> >
> > I think a new patch is better, as I'd like the original to settle down now
> > and the whole point of this doc is that it's a living thing that many
> > people can contribute to, update, etc.
> >
> > For instance, Suren is updating as part of one of his series to correct
> > things that he changes in that series, which is really nice.
> >
> > > ---
> > > Documentation/mm/process_addrs.rst | 21 +++++++++++++++++++--
> > > 1 file changed, 19 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
> > > index 1bf7ad010fc063d003bb857bb3b695a3eafa0b55..9bdf073d0c3ebea1707812508a309aa4a6163660 100644
> > > --- a/Documentation/mm/process_addrs.rst
> > > +++ b/Documentation/mm/process_addrs.rst
> > > @@ -339,6 +339,16 @@ When **installing** page table entries, the mmap or VMA lock must be held to
> > > keep the VMA stable. We explore why this is in the page table locking details
> > > section below.
> > >
> > > +.. warning:: Taking the mmap lock in read mode **is not sufficient** for
> > > + traversing page tables; you must also ensure that a VMA exists that
> > > + covers the range being accessed.
> >
> > Hm, but we say later we don't need _any_ locks for traversal, and here we
> > say we need mmap read lock. Do you mean installing page table entries?
> >
> > Or do you mean to say, that if you don't span a VMA, you must acquire a
> > write lock at least to preclude this?
>
> Yeah, this is what I meant with the "you must also ensure that a VMA
> exists" part. (Context is that I was looking at some non-upstream code
> that was trying to do exactly this - take a userspace-supplied address
> and walk down the page tables at that address. Upstream also once had
> a UAF in pagemap_walk() because walk_page_range() was used while
> holding only the mmap read lock, see
> <https://project-zero.issues.chromium.org/42451485>.)
>
> > This seems quite unclear.
> >
> > I kind of didn't want to touch on the horrors of fiddling about without a
> > VMA, so I'd rather this very clearly say something like 'it is unusual to
> > manipulate page tables wihch are not spanned by a VMA, and there are
> > special requirements for this operation' etc. et.c otherwise this just adds
> > more noise and confusion I think.
>
> I guess maybe we could replace this entire warning block with
> something like this?
>
> ".. warning:: Page tables are normally only traversed in regions
> covered by VMAs. If you want to traverse page tables in areas that
> might not be covered by VMAs, heavier locking is required. See
> :c:func:`!walk_page_range_novma` for details."
Yes I prefer this.
>
> And walk_page_range_novma() already has a comment block talking about
> why an mmap read lock isn't enough to traverse user virtual address
> space outside VMAs.
>
> > > + This ensures you can't race with concurrent page table removal
> > > + which happens with the mmap lock in read mode, in regions whose
> > > + VMAs are no longer present in the VMA tree.
> > > +
> > > + (Alternatively, the mmap lock can be taken in write mode, but that
> > > + is heavy-handed and almost never the right choice.)
> >
> > You kind of need to expand on why that is I think!
> >
> > > +
> > > **Freeing** page tables is an entirely internal memory management operation and
> > > has special requirements (see the page freeing section below for more details).
> > >
> > > @@ -450,6 +460,9 @@ the time of writing of this document.
> > > Locking Implementation Details
> > > ------------------------------
> > >
> > > +.. warning:: Locking rules for PTE-level page tables are very different from
> > > + locking rules for page tables at other levels.
> > > +
> > > Page table locking details
> > > --------------------------
> > >
> > > @@ -470,8 +483,12 @@ additional locks dedicated to page tables:
> > > These locks represent the minimum required to interact with each page table
> > > level, but there are further requirements.
> > >
> > > -Importantly, note that on a **traversal** of page tables, no such locks are
> > > -taken. Whether care is taken on reading the page table entries depends on the
> > > +Importantly, note that on a **traversal** of page tables, sometimes no such
> > > +locks are taken. However, at the PTE level, at least concurrent page table
> > > +deletion must be prevented (using RCU) and the page table must be mapped into
> > > +high memory, see below.
> >
> > Ugh I really do hate that we have to think about high memory. I'd like to
> > sort of deny it exists. But I suppose that's not an option.
>
> (FWIW from a quick look I think only arm and x86 actually have this behavior.)
>
> I mean... we could try to just make userspace page tables live in
> non-high-memory based on the assumption that nobody nowadays is going
> to run a 32-bit kernel on a device with significantly more than 4G or
> 8G or so of RAM, and probably most memory is used for stuff like
> anon/file pages, not for page tables... but I don't think it actually
> matters much for code complexity now that we need RCU for page table
> access anyway.
Right, I just am not a fan of 32-bit kernels :P not your fault!
>
> > As for the RCU thing, I guess this is why pte_offset_map_lock() is taking
> > it? Maybe worth mentioning something there or updating that 'interestingly'
> > block... :>)
>
> Yeah, and not just pte_offset_map_lock() but also pte_offset_map() and
> such which don't lock the page table.
Could you update that 'interestingly' block then also?
>
> > Or am I mistaken? I wasn't aware of this requirement, is this sort of
> > implied by the gup_fast() IRQ disabling stuff?
>
> This is to protect against khugepaged/MADV_COLLAPSE deleting page
> tables while holding only an rmap lock (in read mode), the pmd lock,
> and the pte lock. So a concurrent call to MADV_COLLAPSE (potentially
> in another process) can (via retract_page_tables()) remove the page
> table while we're traversing it, but the page table will only actually
> be freed via RCU (thanks to pte_free_defer), and so this doesn't cause
> use-after-free. pte_offset_map_lock() ensures that on success, it has
> locked the current page table (holding the pte lock prevents the page
> table being removed in the future, and rechecking the pmd entry
> ensures it hasn't already been removed), so we never end up installing
> PTEs into page tables that have already been removed. pte_offset_map()
> does not ensure that you always see the latest page table; it could be
> that by the time you're looking at a PTE inside it, the page table
> you're looking at has been removed, and a new page table has been
> installed in its place and populated with PTEs. But removed page
> tables are always empty, so the only possible consequence of such a
> race is that you wrongly think that there is no PTE at a given
> address.
It'd be good to include this detail in the doc also.
>
> The fourth paragraph of the comment block above
> __pte_offset_map_lock() sorta explains this.
>
> > Please expand :)
Thanks!
Powered by blists - more mailing lists