[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b798c84e-f3fd-486d-ad70-c827385a558a@lucifer.local>
Date: Thu, 24 Jul 2025 06:13:39 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Jann Horn <jannh@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Suren Baghdasaryan <surenb@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>, Pedro Falcato <pfalcato@...e.de>,
Linux-MM <linux-mm@...ck.org>,
kernel list <linux-kernel@...r.kernel.org>
Subject: Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful
vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
On Wed, Jul 23, 2025 at 09:43:35PM +0200, Jann Horn wrote:
> Sorry, while typing up this mail I realized I didn't have this stuff
> particularly straight in my head myself when writing my previous mails
> about this...
>
> On Wed, Jul 23, 2025 at 8:45 PM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
> > On Wed, Jul 23, 2025 at 08:30:30PM +0200, Jann Horn wrote:
> > > On Wed, Jul 23, 2025 at 8:14 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@...cle.com> wrote:
> > > > On Wed, Jul 23, 2025 at 06:26:53PM +0200, Jann Horn wrote:
> > > > > There's a racy UAF in `vma_refcount_put()` when called on the
> > > > > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > > > > without sufficient protection against concurrent object reuse:
> > > > >
> > > > > lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under
> > > > > rcu_read_lock(). At that point, the VMA may be concurrently freed, and
> > > > > it can be recycled by another process. vma_start_read() then
> > > > > increments the vma->vm_refcnt (if it is in an acceptable range), and
> > > > > if this succeeds, vma_start_read() can return a reycled VMA. (As a
> > > > > sidenote, this goes against what the surrounding comments above
> > > > > vma_start_read() and in lock_vma_under_rcu() say - it would probably
> > > > > be cleaner to perform the vma->vm_mm check inside vma_start_read().)
> > > > >
> > > > > In this scenario where the VMA has been recycled, lock_vma_under_rcu()
> > > > > will then detect the mismatching ->vm_mm pointer and drop the VMA
> > > > > through vma_end_read(), which calls vma_refcount_put().
> > > >
> > > > So in _correctly_ identifying the recycling, we then hit a problem. Fun!
> > > >
> > > > > vma_refcount_put() does this:
> > > > >
> > > > > ```
> > > > > static inline void vma_refcount_put(struct vm_area_struct *vma)
> > > > > {
> > > > > /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> > > > > struct mm_struct *mm = vma->vm_mm;
> > > >
> > > > Are we at a point where we _should_ be looking at a VMA with vma->vm_mm ==
> > > > current->mm here?
> > >
> > > Well, you _hope_ to be looking at a VMA with vma->vm_mm==current->mm,
> > > but if you lose a race it is intentional that you can end up with
> > > another MM's VMA here.
Right I get the SLAB_TYPESAFE_BY_RCU thing, what I'm saying overall is 'can we
detect that we lost the race by knowing what mm this should be'...
>
> (I forgot: The mm passed to lock_vma_under_rcu() is potentially
> different from current->mm if we're coming from uffd_mfill_lock(),
> which would be intentional and desired, but that's not relevant here.
> Sorry for making things more confusing.)
...and because of this, no we can't. I hate how uffd is implemented.
But even so, even if we could, it would SUCK as it would leave us with a
_requirement_ that this had to be current->mm or a race-recycled thing.
Stuff like mshare also wants to pull out assumptions about current->mm, so
we don't really want to be doing this anywhere even if we could.
So moot really.
>
> > > > Or can we not safely assume this?
> > >
> > > No.
> >
> > What code paths lead to vma_refcount_put() with a foreign mm?
>
> Calls to vma_refcount_put() from vma_start_read() or from
> lock_vma_under_rcu() can have an MM different from the mm that was
> passed to lock_vma_under_rcu().
>
> Basically, lock_vma_under_rcu() locklessly looks up a VMA in the maple
> tree of the provided MM; and so immediately after the maple tree
> lookup, before we grab any sort of reference on the VMA, the VMA can
> be freed, and reallocated by another process. If we then essentially
> read-lock this VMA which is used by another MM (by incrementing its
> refcount), waiters in that other MM might start waiting for us; and in
> that case, when we notice we got the wrong VMA and bail out, we have
> to wake those waiters up again after unlocking the VMA by dropping its
> refcount.
Thanks for the explanation but I _think_ I understood this bit, what I
meant was 'except for cases where slab recycled the VMA' :), as in
_intentionally_ vma->vm_mm != current->mm.
You've answered above re: uffd.
>
> > I mean maybe it's an unsafe assumption.
> >
> > I realise we are doing stuff for _reasons_, but I sort of HATE that we have
> > put ourselves in a position where we might always see a recycled VMA and
> > rely on a very very complicated seqnum-based locking mechanism to make sure
> > this doesn't happen.
>
> Yes, that is pretty much the definition of SLAB_TYPESAFE_BY_RCU. ^^
> You get higher data cache hit rates in exchange for complicated "grab
> some kinda stable object reference and then recheck if we got the
> right one" stuff, though it is less annoying when dealing with a
> normal refcount or spinlock or such rather than this kind of
> open-coded sleepable read-write semaphore.
Yes, and again I should have more closely scrutinised the original
series... or at least more closely _understood_ it. Entirely my fault,
though due to workload more than anything (hey guess I need to work some
more weekends :)
I understand we're doing this trade-off for a reason. But we're asking for
problems here.
>
> > It feels like we've made ourselves a challenging bed and uncomfy bed...
> >
> > >
> > > > Because if we can, can we not check for that here?
> > > >
> > > > Do we need to keep the old mm around to wake up waiters if we're happily
> > > > freeing it anyway?
> > >
> > > Well, we don't know if the MM has already been freed, or if it is
> > > still alive and well and has writers who are stuck waiting for our
> > > wakeup.
> >
> > But the mm we're talking about here is some recycled one from another
> > thread?
>
> The MM is not recycled, the VMA is recycled.
Yeah sorry it was late, this is what I meant.
VMA is recycled, so different mm.
>
> > Right so, we have:
> >
> > 'mm we meant to get' (which apparently can't be assumed to be current->mm)
> > 'mm we actually got' (which may or may not be freed at any time)
> >
> > The _meant to get_ one might have eternal waiters. Or might not even need
> > to be woken up.
> >
> > I don't see why keeping the 'actually got' one around really helps us? Am I
> > missing something?
>
> We basically have taken a read lock on a VMA that is part of the
> "actually got" MM, and so we may have caused writers from that MM to
> block and sleep, and since we did that we have to wake them back up
> and say "sorry, locked the wrong object, please continue".
OK I think this is the crux of it then, and what I've been missing here -
we have taken a read lock _by mistake_ in effect on the recycled mm, which
may end up to be a spurious one that we need to immediately drop, but
because of this we might have waiters that could wait forever.
OK I get it. But to safely reference the mm here we need to be assured it
stays around because in case of this not being true, we have nothing to
prevent that mm going away right?
Powered by blists - more mailing lists