linux-kernel - Re: [PATCH] mm/mmap_lock: Reset maple state on lock_vma_under

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1885ac9d-1a5e-45a2-90d7-f4ecb5848937@lucifer.local>
Date: Fri, 14 Nov 2025 11:51:11 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        Matthew Wilcox <willy@...radead.org>,
        Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Suren Baghdasaryan <surenb@...gle.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Shakeel Butt <shakeel.butt@...ux.dev>, Jann Horn <jannh@...gle.com>,
        stable@...r.kernel.org,
        syzbot+131f9eb2b5807573275c@...kaller.appspotmail.com,
        "Paul E . McKenney" <paulmck@...nel.org>
Subject: Re: [PATCH] mm/mmap_lock: Reset maple state on lock_vma_under_rcu()
 retry

On Thu, Nov 13, 2025 at 12:28:58PM -0500, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@...cle.com> [251113 05:45]:
> > On Thu, Nov 13, 2025 at 12:04:19AM +0000, Matthew Wilcox wrote:
> > > On Wed, Nov 12, 2025 at 03:06:38PM +0000, Lorenzo Stoakes wrote:
> > > > > Any time the rcu read lock is dropped, the maple state must be
> > > > > invalidated.  Resetting the address and state to MA_START is the safest
> > > > > course of action, which will result in the next operation starting from
> > > > > the top of the tree.
> > > >
> > > > Since we all missed it I do wonder if we need some super clear comment
> > > > saying 'hey if you drop + re-acquire RCU lock you MUST revalidate mas state
> > > > by doing 'blah'.
> > >
> > > I mean, this really isn't an RCU thing.  This is also bad:
> > >
> > > 	spin_lock(a);
> > > 	p = *q;
> > > 	spin_unlock(a);
> > > 	spin_lock(a);
> > > 	b = *p;
> > >
> > > p could have been freed while you didn't hold lock a.  Detecting this
> > > kind of thing needs compiler assistence (ie Rust) to let you know that
> > > you don't have the right to do that any more.
> >
> > Right but in your example the use of the pointers is _realy clear_. In the
> > mas situation, the pointers are embedded in the helper struct, there's a
> > state machine, etc. so it's harder to catch this.
>
> We could modify the above example to use a helper struct and the same
> problem would arise...

I disagree.

It's a helper struct with a state machine, manipulated by API functions. In fact
it turns out we _can_ recover this state even after dropping/reacquiring the
lock by calling the appropriate API functions to do so.

You manipulate this state via mas_xxx() commands, and in fact we resolve the
lock issue by issuing the correct one.

So it's a problem of abstraction I think.

HOWEVER, clearly the crux of the problem as you say elsewhere is that we are
using the 'advanced' API and handling our own lock, which leaves us open to
mistakes like this.

My thought process here is around 'can we avoid a bunch of mm developers all
making the same mistake again'.

In this case I mean - it's a unique situation, in some already _very_ hairy VMA
lock code, that used to be much simpler (*grumble grumble*). We're paying the
price for rolling our own mechanism here in general.

But I think more broadly, perhaps there's things we can do here to help. You
need to be able to go on vacation without having to worry about what mistakes we
might make with this stuff :P

>
> >
> > There's already a state machine embedded in it, and I think the confusing
> > bit, at least for me, was a line of thinking like - 'oh there's all this
> > logic that figures out what's going on and if there's an error rewalks and
> > etc. - so it'll handle this case too'.
> >
> > Obviously, very much wrong.
> >
> > Generally I wonder if, when dealing with VMAs, we shouldn't just use the
> > VMA iterator anyway? Whenever I see 'naked' mas stuff I'm always a little
> > confused as to why.
>
> I am not sure why this was left as maple state either.  But translating
> it to the vma iterator would result in the same bug.  The locking story
> would be the same.  There isn't much to the vma iterator, it will just
> call the mas_ functions for you.

Yes I understand it wouldn't fix the bug :) I'm saying this as an aside, and it
leads into the suggestion I make below.

>
> In other code, the maple state is used when we need to do special
> operations that would be the single user of a vma iterator function.  I
> suspect this was the case here at some point.

Right yes. And perhaps so.

>
> >
> >
> > >
> > > > I think one source of confusion for me with maple tree operations is - what
> > > > to do if we are in a position where some kind of reset is needed?
> > > >
> > > > So even if I'd realised 'aha we need to reset this' it wouldn't be obvious
> > > > to me that we ought to set to the address.
> > >
> > > I think that's a separate problem.
> >
> > Sure but I think there's a broader issue around confusion arising around
> > mas state and when we need to do one thing or another, there were a number
> > of issues that arose in the past where people got confused about what to do
> > with vma iterator state.
> >
> > I think it's a difficult problem - we're both trying to abstract stuff
> > here but also retain performance, which is a trade-off.
> >
> > >
> > > > > +++ b/mm/mmap_lock.c
> > > > > @@ -257,6 +257,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > > > >  		if (PTR_ERR(vma) == -EAGAIN) {
> > > > >  			count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > > >  			/* The area was replaced with another one */
> > > > > +			mas_set(&mas, address);
> > > >
> > > > I wonder if we could detect that the RCU lock was released (+ reacquired) in
> > > > mas_walk() in a debug mode, like CONFIG_VM_DEBUG_MAPLE_TREE?
> > >
> > > Dropping and reacquiring the RCU read lock should have been a big red
> > > flag.  I didn't have time to review the patches, but if I had, I would
> >
> > I think if you have 3 mm developers who all work with VMAs all the time
> > missing this, that's a signal that something is confusing here :)
> >
> > So the issue is we all thought dropping the RCU lock would be OK, and
> > mas_walk(...) would 'somehow' do the right thing. See above for why I think
> > perhaps that happened.
>
> But again, I feel like we could replace the maple state with any helper
> struct and this could also be missed.

I disagree for the reasons stated above.

>
> I'm not sure there's an easy way to remove this class of errors without
> changing the basic tooling to be rust or the like...

Well I like to be optimistic that we can find ways forward without that.

>
> vma_start_read() is inherently complicated because of what it does
> without taking the mmap lock.  Dealing with a potential failure/retry is
> equally messy.

Yes I agree.

>
> The locking is impossible to do in a clean way since one caller does not
> take the rcu read lock itself, but may return without it held in many
> scenarios.

Yes absolutely. I am not necessarily in love with how complicated we've made all
of this and I am not sure it was justified, but unfortunately I didn't pay
enough attention to the VMA lock seqcount rework.

>
> >
> > > have suggested passing the mas down to the routine that drops the rcu
> > > read lock so it can be invalidated before dropping the readlock.
> > >
> >
> > This would require changing vma_start_read(), which is called by both
> > lock_vma_under_rcu() and lock_next_vma().
> >
> > We could make them consistent and have lock_vma_under_rcu() do something
> > like:
> >
> > 	VMA_ITERATOR(vmi, mm, address);
> >
> > 	...
> >
> > 	rcu_read_lock();
> > 	vma = vma_start_read(&vmi);
> >
> > And have vma_start_read() handle the:
> >
> > 	if (!vma) {
> > 		rcu_read_unlock();
> > 		goto inval;
> > 	}
> >
> > Case we have in lock_vma_under_rcu() now.
> >
> > We'd need to keep:
> >
> > 	vma = vma_next(vmi);
> > 	if (!vma)
> > 		return NULL;
> >
> > In lock_next_vma().
> >
> > Then you could have:
> >
> > err:
> > 	/* Reset so state is valid if reused. */
> > 	vmi_iter_reset(vmi);
> > 	rcu_read_unlock();
> >
> > In vma_start_read().
> >
> > Assuming any/all of this is correct :)
> >
> > I _think_ based on what Liam said in other sub-thread the reset should work
> > here (perhaps not quite maximally efficient).
>
> No, don't do that.  If you want to go this route, use vma_iter_set() in
> the error label to set the address.  Which means that we'll need to pass
> the vma iterator and the address into vma_star_read() from both callers.

Well that's what I'm proposing we do re: passing in the vma iterator, so it
seems we're generally aligned on this, but sure we should use vma_iter_set(),
ack on that.

>
> And may as well add this in vma_start_read()..
>
> err_unstable:
>  	vma_iter_set(&vmi, address);

Ack.

>
> >
> > If we risk perhaps relying on the optimiser to help us or hope no real perf
> > impact perhaps we could do both by also having the 'set address' bit happen
> > in lock_vma_under_rcu() also e.g.:
> >
> >
> > 	VMA_ITERATOR(vmi, mm, address);
> >
> > 	...
> >
> > retry:
> > 	rcu_read_lock();
> > 	vma_iter_set(&vmi, address);
> > 	vma = vma_start_read(&vmi);
>
> lock_next_vma() also calls vma_iter_set() in the -EAGAIN case, so
> passing both through might make more sense.

Yes.

>
> >
> > Let me know if any of this is sane... :)
>
> The locking on this function makes it virtually impossible to reuse for
> anything beyond the two users it has today.  Passing the iterator down
> might remind people of what to do if the function itself changes.  It
> does seem like the right way of handling this, since we can't clean up
> the locking.

OK, I can put forward a patch for this if that'd be helpful!

>
> Thanks,
> Liam
>

Cheers, Lorenzo