linux-kernel - Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <rnanfpzs6fmojyeaowt65mug6xekorrkeefvn3b4zc7buunzhc@rrzcbhkrjdv4>
Date: Thu, 19 Dec 2024 13:18:23 -0500
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Suren Baghdasaryan <surenb@...gle.com>, akpm@...ux-foundation.org,
        willy@...radead.org, lorenzo.stoakes@...cle.com, mhocko@...e.com,
        vbabka@...e.cz, hannes@...xchg.org, mjguzik@...il.com,
        oliver.sang@...el.com, mgorman@...hsingularity.net, david@...hat.com,
        peterx@...hat.com, oleg@...hat.com, dave@...olabs.net,
        paulmck@...nel.org, brauner@...nel.org, dhowells@...hat.com,
        hdanton@...a.com, hughd@...gle.com, lokeshgidra@...gle.com,
        minchan@...gle.com, jannh@...gle.com, shakeel.butt@...ux.dev,
        souravpanda@...gle.com, pasha.tatashin@...een.com,
        klarasmodin@...il.com, corbet@....net, linux-doc@...r.kernel.org,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        kernel-team@...roid.com
Subject: Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a
 reference count

* Peter Zijlstra <peterz@...radead.org> [241219 12:42]:
> On Thu, Dec 19, 2024 at 12:16:45PM -0500, Liam R. Howlett wrote:
> 
> > Well, hold on - it is taken out of the rmap/anon vma chain here.  It is
> > completely unhooked except the vma tree at this point.  We're not adding
> > complexity, we're dealing with it.
> 
> So I'm not entirely sure I understand the details here -- this is again
> about being able to do rollback when things fail?

no, it's so that things can be ready for the replacement that won't
cause issues during insertion.  We can't have two VMAs in the rmap for
the same area, for instance.

I would like to be able to undo things, but if there's a file that's
closed then we can't undo that.

> 
> There is comment above the vms_clean_up_area() call in __mmap_prepare(),
> but its not making sense atm.
> 
> > >Is there anything that would prevent a concurrent gup_fast() from
> > > > doing the same -- touch a cleared PTE?
> > 
> > Where does gup_fast() install PTEs?  Doesn't it bail once a READ_ONCE()
> > on any level returns no PTE?
> 
> I think you're right, GUP doesn't, but any 'normal' page-table walker
> will.

Normal page-table walkers will be locked out by the page table lock
while the area is cleared, it cannot be re-walked after that because of
the vma lock (except rcu walkers, which is why these new locks are
needed).  Any direct page table walking won't be a problem because we've
removed any way to get to it - I think?

> 
> > > > AFAICT two threads, one doing overlapping mmap() and the other doing
> > > > gup_fast() can result in exactly this scenario.
> > 
> > The mmap() call will race with the gup_fast(), but either the nr_pinned
> > will be returned from gup_fast() before vms_clean_up_area() removes the
> > page table (or any higher level), or gup_fast() will find nothing.
> 
> Agreed.
> 
> > > > If we don't care about the GUP case, when I'm thinking we should not
> > > > care about the lockless RCU case either.
> > > 
> > > Also, at this point we'll just fail to find a page, and that is nothing
> > > special. The problem with accessing an unmapped VMA is that the
> > > page-table walk will instantiate page-tables.
> > 
> > I think there is a problem if we are reinstalling page tables on a vma
> > that's about to be removed.  I think we are avoiding this with our
> > locking though?
> 
> So this is purely about the overlapping part, right? We need to remove
> the old pages, install the new mapping and have new pages populate the
> thing.

No.  If we allow rcu readers to re-fault, we may hit a race where the
page fault has found the vma, but doesn't fault things in before the
ptes are removed or the vma is freed/reused.  Today that won't happen
because we mark the vma as going to be removed before the tree is
changed, so anyone reaching the vma will see it's not safe.

If we want to use different locking strategies for munmap() vs
MAP_FIXED, then we'd need to be sure the vma that is being freed is
isolated for removal and all readers are done before freeing.  I think
this is what you are trying to convey to Suren?

I don't like the idea of another locking strategy in munmap() vs
MAP_FIXED, especially if you think about who would be doing this..
basically no one.  There isn't a sane (legitimate) application that's
going to try and page fault in something that's being removed.

> 
> But either way around, the range stays valid and page-tables stay
> needed.
> 
> > > Given this is an overlapping mmap -- we're going to need to those
> > > page-tables anyway, so no harm done.
> > 
> > Well, maybe?  The mapping may now be an anon vma vs a file backed, or
> > maybe it's PROT_NONE?
> 
> The page-tables don't care about all that no? The only thing where it
> matters is for things like THP, because that affects the level of
> page-tables, but otherwise it's all page-table content (ptes).

It sounds racy, couldn't we read the old page table entry attributes and
act on it after the new attributes are set?

During the switch, after we drop the page table lock but haven't yet
inserted the new vma, then we'd run into a situation where the new
mapping may already be occupied if we don't have some sort of locking
here.  Wouldn't that be an issue as well?  It seems like there are a
number of things that could go bad?

> 
> > > Only after the VMA is unlinked must we ensure we don't accidentally
> > > re-instantiate page-tables.
> > 
> > It's not as simple as that, unfortunately.  There are vma callbacks for
> > drivers (or hugetlbfs, or whatever) that do other things.  So we need to
> > clean up the area before we are able to replace the vma and part of that
> > clean up is the page tables, or anon vma chain, and/or closing a file.
> > 
> > There are other ways of finding the vma as well, besides the vma tree.
> > We are following the locking so that we are safe from those perspectives
> > as well, and so the vma has to be unlinked in a few places in a certain
> > order.
> 
> For RCU lookups only the mas tree matters -- and its left present there.
> 
> If you really want to block RCU readers, I would suggest punching a hole
> in the mm_mt. All the traditional code won't notice anyway, this is all
> with mmap_lock held for writing.

We don't want to block all rcu readers, we want to block the rcu readers
that would see the problem - that is, anyone trying to read a particular
area.

Right now we can page fault in unpopulated vmas while writing other vmas
to the tree.  We are also moving more users to rcu reading to use the
vmas they need without waiting on writes to finish.

Maybe I don't understand your suggestion, but I would think punching a
hole would lose this advantage?

Thanks,
Liam