linux-kernel - [BUG] hard-to-hit mm_struct UAF due to insufficiently careful vma_refcount_put() wrt SLAB_TYPESAFE_BY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG48ez0-deFbVH=E3jbkWx=X3uVbd8nWeo6kbJPQ0KoUD+m2tA@mail.gmail.com>
Date: Wed, 23 Jul 2025 18:26:53 +0200
From: Jann Horn <jannh@...gle.com>
To: Andrew Morton <akpm@...ux-foundation.org>, "Liam R. Howlett" <Liam.Howlett@...cle.com>, 
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Suren Baghdasaryan <surenb@...gle.com>, 
	Vlastimil Babka <vbabka@...e.cz>
Cc: Pedro Falcato <pfalcato@...e.de>, Linux-MM <linux-mm@...ck.org>, 
	kernel list <linux-kernel@...r.kernel.org>
Subject: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful
 vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU

There's a racy UAF in `vma_refcount_put()` when called on the
`lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
without sufficient protection against concurrent object reuse:

lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under
rcu_read_lock(). At that point, the VMA may be concurrently freed, and
it can be recycled by another process. vma_start_read() then
increments the vma->vm_refcnt (if it is in an acceptable range), and
if this succeeds, vma_start_read() can return a reycled VMA. (As a
sidenote, this goes against what the surrounding comments above
vma_start_read() and in lock_vma_under_rcu() say - it would probably
be cleaner to perform the vma->vm_mm check inside vma_start_read().)

In this scenario where the VMA has been recycled, lock_vma_under_rcu()
will then detect the mismatching ->vm_mm pointer and drop the VMA
through vma_end_read(), which calls vma_refcount_put().
vma_refcount_put() does this:

```
static inline void vma_refcount_put(struct vm_area_struct *vma)
{
        /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
        struct mm_struct *mm = vma->vm_mm;
        int oldcnt;

        rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
        if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {

                if (is_vma_writer_only(oldcnt - 1))
                        rcuwait_wake_up(&mm->vma_writer_wait);
        }
}
```

This is wrong: It implicitly assumes that the caller is keeping the
VMA's mm alive, but in this scenario the caller has no relation to the
VMA's mm, so the rcuwait_wake_up() can cause UAF.

In theory, this could happen to any multithreaded process where thread
A is in the middle of pagefault handling while thread B is
manipulating adjacent VMAs such that VMA merging frees the VMA looked
up by thread A - but in practice, for UAF to actually happen, I think
you need to at least hit three race windows in a row that are each on
the order of a single memory access wide.

The interleaving leading to UAF is the following, where threads A1 and
A2 are part of one process and thread B1 is part of another process:
```
A1               A2               A3
==               ==               ==
lock_vma_under_rcu
  mas_walk
                 <VMA modification removes the VMA>
                                  mmap
                                    <reallocate the VMA>
  vma_start_read
    READ_ONCE(vma->vm_lock_seq)
    __refcount_inc_not_zero_limited_acquire
                                  munmap
                                    __vma_enter_locked
                                      refcount_add_not_zero
  vma_end_read
    vma_refcount_put
      __refcount_dec_and_test
                                      rcuwait_wait_event
                                    <finish operation>
      rcuwait_wake_up [UAF]
```

I'm not sure what the right fix is; I guess one approach would be to
have a special version of vma_refcount_put() for cases where the VMA
has been recycled by another MM that grabs an extra reference to the
MM? But then dropping a reference to the MM afterwards might be a bit
annoying and might require something like mmdrop_async()...


# Reproducer
If you want to actually reproduce this, uh, I have a way to reproduce
it but it's ugly: First apply the KASAN patch
https://lore.kernel.org/all/20250723-kasan-tsbrcu-noquarantine-v1-1-846c8645976c@google.com/
, then apply the attached diff vma-lock-delay-inject.diff to inject
delays in four different places and add some logging, then build with:

CONFIG_KASAN=y
CONFIG_PREEMPT=y
CONFIG_SLUB_RCU_DEBUG must be explicitly disabled!

Then run the resulting kernel, move everything off CPU 0 by running
"for pid in $(ls /proc | grep '^[1-9]'); do taskset -p -a 0xe $pid;
done" as root, and then run the attached testcase vmalock-uaf.c.

That should result in output like:
```
[  105.018129][ T1334] vma_start_read: PRE-INCREMENT DELAY START on
VMA ffff888134b31180
[  106.026145][ T1335] vm_area_alloc: writer allocated vma ffff888134b31180
[  107.024146][ T1334] vma_start_read: PRE-INCREMENT DELAY END
[  107.025836][ T1334] vma_start_read: returning vma
[  107.026800][ T1334] vma_refcount_put: PRE-DECREMENT DELAY START
[  107.528751][ T1335] __vma_enter_locked: BEGIN DELAY
[  110.535863][ T1334] vma_refcount_put: PRE-DECREMENT DELAY END
[  110.537553][ T1334] vma_refcount_put: PRE-WAKEUP DELAY START
(is_vma_writer_only()=1)
[  111.529833][ T1335] __vma_enter_locked: END DELAY
[  121.037571][ T1334] vma_refcount_put: PRE-WAKEUP DELAY END
[  121.039259][ T1334]
==================================================================
[  121.040792][ T1334] BUG: KASAN: slab-use-after-free in
rcuwait_wake_up+0x33/0x60
[  121.042345][ T1334] Read of size 8 at addr ffff8881223545f0 by task TEST/1334
[  121.043698][ T1334]
[  121.044175][ T1334] CPU: 0 UID: 1000 PID: 1334 Comm: TEST Not
tainted 6.16.0-rc7-00002-g0df7d6c9705b-dirty #179 PREEMPT
[  121.044180][ T1334] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[  121.044182][ T1334] Call Trace:
[  121.044193][ T1334]  <TASK>
[  121.044198][ T1334]  __dump_stack+0x15/0x20
[  121.044216][ T1334]  dump_stack_lvl+0x6c/0xa0
[  121.044219][ T1334]  print_report+0xbc/0x250
[  121.044224][ T1334]  ? rcuwait_wake_up+0x33/0x60
[  121.044228][ T1334]  kasan_report+0x148/0x180
[  121.044249][ T1334]  ? rcuwait_wake_up+0x33/0x60
[  121.044251][ T1334]  __asan_report_load8_noabort+0x10/0x20
[  121.044257][ T1334]  rcuwait_wake_up+0x33/0x60
[  121.044259][ T1334]  vma_refcount_put+0xbd/0x180
[  121.044267][ T1334]  lock_vma_under_rcu+0x438/0x490
[  121.044271][ T1334]  do_user_addr_fault+0x24c/0xbf0
[  121.044278][ T1334]  exc_page_fault+0x5d/0x90
[  121.044297][ T1334]  asm_exc_page_fault+0x22/0x30
[  121.044304][ T1334] RIP: 0033:0x556803378682
[  121.044317][ T1334] Code: ff 89 45 d4 83 7d d4 ff 75 19 48 8d 05 d7
0b 00 00 48 89 c6 bf 01 00 00 00 b8 00 00 00 00 e8 35 fa ff ff 48 8b
05 26 2a 00 00 <0f> b6 00 48 8d 05 d7 0b 00 00 48 89 c6 bf 0f 00 00 00
b8 00 00 00
[  121.044322][ T1334] RSP: 002b:00007ffcd73e98a0 EFLAGS: 00010213
[  121.044325][ T1334] RAX: 00007fcc5c28c000 RBX: 00007ffcd73e9aa8
RCX: 00007fcc5c18b23a
[  121.044327][ T1334] RDX: 0000000000000007 RSI: 000055680337923a
RDI: 000000000000000f
[  121.044329][ T1334] RBP: 00007ffcd73e9990 R08: 00000000ffffffff
R09: 0000000000000000
[  121.044330][ T1334] R10: 00007fcc5c1869c2 R11: 0000000000000202
R12: 0000000000000000
[  121.044332][ T1334] R13: 00007ffcd73e9ac0 R14: 00007fcc5c2cc000
R15: 000055680337add8
[  121.044335][ T1334]  </TASK>
[  121.044336][ T1334]
[  121.077313][ T1334] Allocated by task 1334:
[  121.078117][ T1334]  kasan_save_track+0x3a/0x80
[  121.078982][ T1334]  kasan_save_alloc_info+0x38/0x50
[  121.080012][ T1334]  __kasan_slab_alloc+0x47/0x60
[  121.080920][ T1334]  kmem_cache_alloc_noprof+0x19e/0x370
[  121.081936][ T1334]  copy_mm+0xb7/0x400
[  121.082724][ T1334]  copy_process+0xe1f/0x2ac0
[  121.083585][ T1334]  kernel_clone+0x14b/0x540
[  121.084511][ T1334]  __x64_sys_clone+0x11d/0x150
[  121.085397][ T1334]  x64_sys_call+0x2c55/0x2fa0
[  121.086271][ T1334]  do_syscall_64+0x48/0x120
[  121.087114][ T1334]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  121.088297][ T1334]
[  121.088730][ T1334] Freed by task 1334:
[  121.090219][ T1334]  kasan_save_track+0x3a/0x80
[  121.091083][ T1334]  kasan_save_free_info+0x42/0x50
[  121.092122][ T1334]  __kasan_slab_free+0x3d/0x60
[  121.093008][ T1334]  kmem_cache_free+0xf5/0x300
[  121.093878][ T1334]  __mmdrop+0x260/0x360
[  121.094645][ T1334]  finish_task_switch+0x29c/0x6d0
[  121.095645][ T1334]  __schedule+0x1396/0x2140
[  121.096521][ T1334]  preempt_schedule_irq+0x67/0xc0
[  121.097456][ T1334]  raw_irqentry_exit_cond_resched+0x30/0x40
[  121.098560][ T1334]  irqentry_exit+0x3f/0x50
[  121.099386][ T1334]  sysvec_apic_timer_interrupt+0x3e/0x80
[  121.100489][ T1334]  asm_sysvec_apic_timer_interrupt+0x16/0x20
[  121.101642][ T1334]
[  121.102132][ T1334] The buggy address belongs to the object at
ffff888122354500
[  121.102132][ T1334]  which belongs to the cache mm_struct of size 1304
[  121.106296][ T1334] The buggy address is located 240 bytes inside of
[  121.106296][ T1334]  freed 1304-byte region [ffff888122354500,
ffff888122354a18)
[  121.109452][ T1334]
[  121.109964][ T1334] The buggy address belongs to the physical page:
[  121.111363][ T1334] page: refcount:0 mapcount:0
mapping:0000000000000000 index:0xffff888122354ac0 pfn:0x122350
[  121.113626][ T1334] head: order:3 mapcount:0 entire_mapcount:0
nr_pages_mapped:-1 pincount:0
[  121.115475][ T1334] memcg:ffff88811b750641
[  121.116414][ T1334] flags: 0x200000000000240(workingset|head|node=0|zone=2)
[  121.117949][ T1334] page_type: f5(slab)
[  121.118815][ T1334] raw: 0200000000000240 ffff888100050640
ffff88810004e9d0 ffff88810004e9d0
[  121.121931][ T1334] raw: ffff888122354ac0 000000000016000d
00000000f5000000 ffff88811b750641
[  121.123820][ T1334] head: 0200000000000240 ffff888100050640
ffff88810004e9d0 ffff88810004e9d0
[  121.125722][ T1334] head: ffff888122354ac0 000000000016000d
00000000f5000000 ffff88811b750641
[  121.127596][ T1334] head: 0200000000000003 ffffea000488d401
ffffea00ffffffff 00000000ffffffff
[  121.129469][ T1334] head: ffffffffffffffff 0000000000000000
00000000ffffffff 0000000000000008
[  121.131335][ T1334] page dumped because: kasan: bad access detected
[  121.132823][ T1334] page_owner tracks the page as allocated
[  121.134065][ T1334] page last allocated via order 3, migratetype
Unmovable, gfp_mask
0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC),
pid 1165, tgid 1165 (bash), ts 98173533301, free_ts 98103770759
[  121.139486][ T1334]  post_alloc_hook+0x17a/0x190
[  121.140534][ T1334]  get_page_from_freelist+0x2edc/0x2f90
[  121.141731][ T1334]  __alloc_frozen_pages_noprof+0x1c1/0x4d0
[  121.143022][ T1334]  alloc_pages_mpol+0x14e/0x2b0
[  121.144098][ T1334]  alloc_frozen_pages_noprof+0xc4/0xf0
[  121.145318][ T1334]  allocate_slab+0x8f/0x280
[  121.146296][ T1334]  ___slab_alloc+0x3d5/0x8e0
[  121.147292][ T1334]  kmem_cache_alloc_noprof+0x229/0x370
[  121.148489][ T1334]  copy_mm+0xb7/0x400
[  121.149355][ T1334]  copy_process+0xe1f/0x2ac0
[  121.150352][ T1334]  kernel_clone+0x14b/0x540
[  121.151330][ T1334]  __x64_sys_clone+0x11d/0x150
[  121.152472][ T1334]  x64_sys_call+0x2c55/0x2fa0
[  121.153491][ T1334]  do_syscall_64+0x48/0x120
[  121.154479][ T1334]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  121.156725][ T1334] page last free pid 1321 tgid 1321 stack trace:
[  121.158098][ T1334]  __free_frozen_pages+0xa1c/0xc80
[  121.159205][ T1334]  free_frozen_pages+0xc/0x20
[  121.160229][ T1334]  __free_slab+0xad/0xc0
[  121.161149][ T1334]  free_slab+0x17/0x100
[  121.162102][ T1334]  free_to_partial_list+0x48f/0x5b0
[  121.163239][ T1334]  __slab_free+0x1e5/0x240
[  121.164210][ T1334]  ___cache_free+0xb3/0xf0
[  121.165174][ T1334]  qlist_free_all+0xb7/0x160
[  121.166176][ T1334]  kasan_quarantine_reduce+0x14b/0x170
[  121.167358][ T1334]  __kasan_slab_alloc+0x1e/0x60
[  121.168456][ T1334]  kmem_cache_alloc_noprof+0x19e/0x370
[  121.169638][ T1334]  getname_flags+0x9c/0x490
[  121.170631][ T1334]  do_sys_openat2+0x55/0x100
[  121.171629][ T1334]  __x64_sys_openat+0xf4/0x120
[  121.173713][ T1334]  x64_sys_call+0x1ab/0x2fa0
[  121.174716][ T1334]  do_syscall_64+0x48/0x120
[  121.175697][ T1334]
[  121.176216][ T1334] Memory state around the buggy address:
[  121.177433][ T1334]  ffff888122354480: fc fc fc fc fc fc fc fc fc
fc fc fc fc fc fc fc
[  121.179173][ T1334]  ffff888122354500: fa fb fb fb fb fb fb fb fb
fb fb fb fb fb fb fb
[  121.180922][ T1334] >ffff888122354580: fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb fb
[  121.182718][ T1334]
             ^
[  121.184634][ T1334]  ffff888122354600: fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb fb
[  121.186398][ T1334]  ffff888122354680: fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb fb
[  121.189204][ T1334]
==================================================================
```

View attachment "vma-lock-delay-inject.diff" of type "text/x-patch" (3621 bytes)

View attachment "vmalock-uaf.c" of type "text/x-csrc" (1876 bytes)