linux-kernel - Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3a233a85-3a94-422e-87be-591f93acbac7@lucifer.local>
Date: Wed, 23 Jul 2025 19:45:51 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Jann Horn <jannh@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Vlastimil Babka <vbabka@...e.cz>, Pedro Falcato <pfalcato@...e.de>,
        Linux-MM <linux-mm@...ck.org>,
        kernel list <linux-kernel@...r.kernel.org>
Subject: Re: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful
 vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU

On Wed, Jul 23, 2025 at 08:30:30PM +0200, Jann Horn wrote:
> On Wed, Jul 23, 2025 at 8:14 PM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
> > On Wed, Jul 23, 2025 at 06:26:53PM +0200, Jann Horn wrote:
> > > There's a racy UAF in `vma_refcount_put()` when called on the
> > > `lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
> > > without sufficient protection against concurrent object reuse:
> > >
> > > lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under
> > > rcu_read_lock(). At that point, the VMA may be concurrently freed, and
> > > it can be recycled by another process. vma_start_read() then
> > > increments the vma->vm_refcnt (if it is in an acceptable range), and
> > > if this succeeds, vma_start_read() can return a reycled VMA. (As a
> > > sidenote, this goes against what the surrounding comments above
> > > vma_start_read() and in lock_vma_under_rcu() say - it would probably
> > > be cleaner to perform the vma->vm_mm check inside vma_start_read().)
> > >
> > > In this scenario where the VMA has been recycled, lock_vma_under_rcu()
> > > will then detect the mismatching ->vm_mm pointer and drop the VMA
> > > through vma_end_read(), which calls vma_refcount_put().
> >
> > So in _correctly_ identifying the recycling, we then hit a problem. Fun!
> >
> > > vma_refcount_put() does this:
> > >
> > > ```
> > > static inline void vma_refcount_put(struct vm_area_struct *vma)
> > > {
> > >         /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> > >         struct mm_struct *mm = vma->vm_mm;
> >
> > Are we at a point where we _should_ be looking at a VMA with vma->vm_mm ==
> > current->mm here?
>
> Well, you _hope_ to be looking at a VMA with vma->vm_mm==current->mm,
> but if you lose a race it is intentional that you can end up with
> another MM's VMA here.
>
> > Or can we not safely assume this?
>
> No.

What code paths lead to vma_refcount_put() with a foreign mm?

I mean maybe it's an unsafe assumption.

I realise we are doing stuff for _reasons_, but I sort of HATE that we have
put ourselves in a position where we might always see a recycled VMA and
rely on a very very complicated seqnum-based locking mechanism to make sure
this doesn't happen.

It feels like we've made ourselves a challenging bed and uncomfy bed...

>
> > Because if we can, can we not check for that here?
> >
> > Do we need to keep the old mm around to wake up waiters if we're happily
> > freeing it anyway?
>
> Well, we don't know if the MM has already been freed, or if it is
> still alive and well and has writers who are stuck waiting for our
> wakeup.

But the mm we're talking about here is some recycled one from another
thread?

Right so, we have:

'mm we meant to get' (which apparently can't be assumed to be current->mm)
'mm we actually got' (which may or may not be freed at any time)

The _meant to get_ one might have eternal waiters. Or might not even need
to be woken up.

I don't see why keeping the 'actually got' one around really helps us? Am I
missing something?

Or is this not the proposal?

>
> > If not, then surely we can just do this check, and not do the wake up if
> > false?
> >
> > I may be mising something or being stupid here so :P
> >
> > >         int oldcnt;
> > >
> > >         rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > >         if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> > >
> > >                 if (is_vma_writer_only(oldcnt - 1))
> > >                         rcuwait_wake_up(&mm->vma_writer_wait);
> > >         }
> > > }
> > > ```
> > >
> > > This is wrong: It implicitly assumes that the caller is keeping the
> > > VMA's mm alive, but in this scenario the caller has no relation to the
> > > VMA's mm, so the rcuwait_wake_up() can cause UAF.
> >
> > Oh yikes. Yikes yikes yikes.
> >
> > >
> > > In theory, this could happen to any multithreaded process where thread
> > > A is in the middle of pagefault handling while thread B is
> > > manipulating adjacent VMAs such that VMA merging frees the VMA looked
> > > up by thread A - but in practice, for UAF to actually happen, I think
> > > you need to at least hit three race windows in a row that are each on
> > > the order of a single memory access wide.
> >
> > Hmm ok this is confusing, below you reference threads by number with
> > letter=process, and here you say thread 'A' and "B'.
>
> Ah, yeah, sorry.
>
> > > The interleaving leading to UAF is the following, where threads A1 and
> > > A2 are part of one process and thread B1 is part of another process:
> >
> > Err but you refer to A1, A2, A3 below, should A3=B1?
>
> ... clearly I should have proofread this more carefully. Yes, A3 should be B1.
>
> >
> > > ```
> > > A1               A2               A3
> > > ==               ==               ==
> > > lock_vma_under_rcu
> > >   mas_walk
> > >                  <VMA modification removes the VMA>
> > >                                   mmap
> > >                                     <reallocate the VMA>
> > >   vma_start_read
> > >     READ_ONCE(vma->vm_lock_seq)
> > >     __refcount_inc_not_zero_limited_acquire
> > >                                   munmap
> >                                    ^
> >                                    |
> >                                    OK so here, we have unmapped and
> >                                    returned VMA to the slab so
> >                                    SLAB_TYPESAFE_BY_RCU may now result in
> >                                    you having a VMA with vma->vm_mm !=
> >                                    current->mm right?
>
> Yes. The VMA is unmapped and returned to the slab at the "<VMA
> modification removes the VMA>" in A2 above, and then reallocated when
> B1 does "mmap".
>
> > >                                     __vma_enter_locked
> > >                                       refcount_add_not_zero
> > >   vma_end_read
> > >     vma_refcount_put
> >       ^
> >       |
> >       Then here, we're grabbing process
> >       B's mm... whoops...
>
> Yeah.
>
> > >       __refcount_dec_and_test
> > >                                       rcuwait_wait_event
> > >                                     <finish operation>
> >                                       ^
> >                                       |
> >                                       Then here, the VMA is finally
> >                                       freed right?
>
> Yeah.
>
> > >       rcuwait_wake_up [UAF]
> >         ^
> >         |
> >         And then here we try to deref freed mm->vma_writer_wait
> >         and boom.
>
> Yeah.
>
> > > I'm not sure what the right fix is; I guess one approach would be to
> > > have a special version of vma_refcount_put() for cases where the VMA
> > > has been recycled by another MM that grabs an extra reference to the
> > > MM? But then dropping a reference to the MM afterwards might be a bit
> > > annoying and might require something like mmdrop_async()...
> >
> > This seems a bit convoluted... And I worry about us getting the refcount
> > stuff wrong somehow, we'd have to be very careful about this to avoid
> > leaking VMAs somehow...
>
> FWIW, the slow and boring solution would be to use wake_up_var() and
> wait_var_event(), but those might be too slow for this usecase.

I don't know what those are, but they seem like some general waiter/wake up
mechanism and yeah... the whole point of this VMA lock stuff is perf so
it's probably a no-go.