[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG48ez0-deFbVH=E3jbkWx=X3uVbd8nWeo6kbJPQ0KoUD+m2tA@mail.gmail.com>
Date: Wed, 23 Jul 2025 18:26:53 +0200
From: Jann Horn <jannh@...gle.com>
To: Andrew Morton <akpm@...ux-foundation.org>, "Liam R. Howlett" <Liam.Howlett@...cle.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Suren Baghdasaryan <surenb@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>
Cc: Pedro Falcato <pfalcato@...e.de>, Linux-MM <linux-mm@...ck.org>,
kernel list <linux-kernel@...r.kernel.org>
Subject: [BUG] hard-to-hit mm_struct UAF due to insufficiently careful
vma_refcount_put() wrt SLAB_TYPESAFE_BY_RCU
There's a racy UAF in `vma_refcount_put()` when called on the
`lock_vma_under_rcu()` path because `SLAB_TYPESAFE_BY_RCU` is used
without sufficient protection against concurrent object reuse:
lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under
rcu_read_lock(). At that point, the VMA may be concurrently freed, and
it can be recycled by another process. vma_start_read() then
increments the vma->vm_refcnt (if it is in an acceptable range), and
if this succeeds, vma_start_read() can return a reycled VMA. (As a
sidenote, this goes against what the surrounding comments above
vma_start_read() and in lock_vma_under_rcu() say - it would probably
be cleaner to perform the vma->vm_mm check inside vma_start_read().)
In this scenario where the VMA has been recycled, lock_vma_under_rcu()
will then detect the mismatching ->vm_mm pointer and drop the VMA
through vma_end_read(), which calls vma_refcount_put().
vma_refcount_put() does this:
```
static inline void vma_refcount_put(struct vm_area_struct *vma)
{
/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
struct mm_struct *mm = vma->vm_mm;
int oldcnt;
rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
if (is_vma_writer_only(oldcnt - 1))
rcuwait_wake_up(&mm->vma_writer_wait);
}
}
```
This is wrong: It implicitly assumes that the caller is keeping the
VMA's mm alive, but in this scenario the caller has no relation to the
VMA's mm, so the rcuwait_wake_up() can cause UAF.
In theory, this could happen to any multithreaded process where thread
A is in the middle of pagefault handling while thread B is
manipulating adjacent VMAs such that VMA merging frees the VMA looked
up by thread A - but in practice, for UAF to actually happen, I think
you need to at least hit three race windows in a row that are each on
the order of a single memory access wide.
The interleaving leading to UAF is the following, where threads A1 and
A2 are part of one process and thread B1 is part of another process:
```
A1 A2 A3
== == ==
lock_vma_under_rcu
mas_walk
<VMA modification removes the VMA>
mmap
<reallocate the VMA>
vma_start_read
READ_ONCE(vma->vm_lock_seq)
__refcount_inc_not_zero_limited_acquire
munmap
__vma_enter_locked
refcount_add_not_zero
vma_end_read
vma_refcount_put
__refcount_dec_and_test
rcuwait_wait_event
<finish operation>
rcuwait_wake_up [UAF]
```
I'm not sure what the right fix is; I guess one approach would be to
have a special version of vma_refcount_put() for cases where the VMA
has been recycled by another MM that grabs an extra reference to the
MM? But then dropping a reference to the MM afterwards might be a bit
annoying and might require something like mmdrop_async()...
# Reproducer
If you want to actually reproduce this, uh, I have a way to reproduce
it but it's ugly: First apply the KASAN patch
https://lore.kernel.org/all/20250723-kasan-tsbrcu-noquarantine-v1-1-846c8645976c@google.com/
, then apply the attached diff vma-lock-delay-inject.diff to inject
delays in four different places and add some logging, then build with:
CONFIG_KASAN=y
CONFIG_PREEMPT=y
CONFIG_SLUB_RCU_DEBUG must be explicitly disabled!
Then run the resulting kernel, move everything off CPU 0 by running
"for pid in $(ls /proc | grep '^[1-9]'); do taskset -p -a 0xe $pid;
done" as root, and then run the attached testcase vmalock-uaf.c.
That should result in output like:
```
[ 105.018129][ T1334] vma_start_read: PRE-INCREMENT DELAY START on
VMA ffff888134b31180
[ 106.026145][ T1335] vm_area_alloc: writer allocated vma ffff888134b31180
[ 107.024146][ T1334] vma_start_read: PRE-INCREMENT DELAY END
[ 107.025836][ T1334] vma_start_read: returning vma
[ 107.026800][ T1334] vma_refcount_put: PRE-DECREMENT DELAY START
[ 107.528751][ T1335] __vma_enter_locked: BEGIN DELAY
[ 110.535863][ T1334] vma_refcount_put: PRE-DECREMENT DELAY END
[ 110.537553][ T1334] vma_refcount_put: PRE-WAKEUP DELAY START
(is_vma_writer_only()=1)
[ 111.529833][ T1335] __vma_enter_locked: END DELAY
[ 121.037571][ T1334] vma_refcount_put: PRE-WAKEUP DELAY END
[ 121.039259][ T1334]
==================================================================
[ 121.040792][ T1334] BUG: KASAN: slab-use-after-free in
rcuwait_wake_up+0x33/0x60
[ 121.042345][ T1334] Read of size 8 at addr ffff8881223545f0 by task TEST/1334
[ 121.043698][ T1334]
[ 121.044175][ T1334] CPU: 0 UID: 1000 PID: 1334 Comm: TEST Not
tainted 6.16.0-rc7-00002-g0df7d6c9705b-dirty #179 PREEMPT
[ 121.044180][ T1334] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 121.044182][ T1334] Call Trace:
[ 121.044193][ T1334] <TASK>
[ 121.044198][ T1334] __dump_stack+0x15/0x20
[ 121.044216][ T1334] dump_stack_lvl+0x6c/0xa0
[ 121.044219][ T1334] print_report+0xbc/0x250
[ 121.044224][ T1334] ? rcuwait_wake_up+0x33/0x60
[ 121.044228][ T1334] kasan_report+0x148/0x180
[ 121.044249][ T1334] ? rcuwait_wake_up+0x33/0x60
[ 121.044251][ T1334] __asan_report_load8_noabort+0x10/0x20
[ 121.044257][ T1334] rcuwait_wake_up+0x33/0x60
[ 121.044259][ T1334] vma_refcount_put+0xbd/0x180
[ 121.044267][ T1334] lock_vma_under_rcu+0x438/0x490
[ 121.044271][ T1334] do_user_addr_fault+0x24c/0xbf0
[ 121.044278][ T1334] exc_page_fault+0x5d/0x90
[ 121.044297][ T1334] asm_exc_page_fault+0x22/0x30
[ 121.044304][ T1334] RIP: 0033:0x556803378682
[ 121.044317][ T1334] Code: ff 89 45 d4 83 7d d4 ff 75 19 48 8d 05 d7
0b 00 00 48 89 c6 bf 01 00 00 00 b8 00 00 00 00 e8 35 fa ff ff 48 8b
05 26 2a 00 00 <0f> b6 00 48 8d 05 d7 0b 00 00 48 89 c6 bf 0f 00 00 00
b8 00 00 00
[ 121.044322][ T1334] RSP: 002b:00007ffcd73e98a0 EFLAGS: 00010213
[ 121.044325][ T1334] RAX: 00007fcc5c28c000 RBX: 00007ffcd73e9aa8
RCX: 00007fcc5c18b23a
[ 121.044327][ T1334] RDX: 0000000000000007 RSI: 000055680337923a
RDI: 000000000000000f
[ 121.044329][ T1334] RBP: 00007ffcd73e9990 R08: 00000000ffffffff
R09: 0000000000000000
[ 121.044330][ T1334] R10: 00007fcc5c1869c2 R11: 0000000000000202
R12: 0000000000000000
[ 121.044332][ T1334] R13: 00007ffcd73e9ac0 R14: 00007fcc5c2cc000
R15: 000055680337add8
[ 121.044335][ T1334] </TASK>
[ 121.044336][ T1334]
[ 121.077313][ T1334] Allocated by task 1334:
[ 121.078117][ T1334] kasan_save_track+0x3a/0x80
[ 121.078982][ T1334] kasan_save_alloc_info+0x38/0x50
[ 121.080012][ T1334] __kasan_slab_alloc+0x47/0x60
[ 121.080920][ T1334] kmem_cache_alloc_noprof+0x19e/0x370
[ 121.081936][ T1334] copy_mm+0xb7/0x400
[ 121.082724][ T1334] copy_process+0xe1f/0x2ac0
[ 121.083585][ T1334] kernel_clone+0x14b/0x540
[ 121.084511][ T1334] __x64_sys_clone+0x11d/0x150
[ 121.085397][ T1334] x64_sys_call+0x2c55/0x2fa0
[ 121.086271][ T1334] do_syscall_64+0x48/0x120
[ 121.087114][ T1334] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 121.088297][ T1334]
[ 121.088730][ T1334] Freed by task 1334:
[ 121.090219][ T1334] kasan_save_track+0x3a/0x80
[ 121.091083][ T1334] kasan_save_free_info+0x42/0x50
[ 121.092122][ T1334] __kasan_slab_free+0x3d/0x60
[ 121.093008][ T1334] kmem_cache_free+0xf5/0x300
[ 121.093878][ T1334] __mmdrop+0x260/0x360
[ 121.094645][ T1334] finish_task_switch+0x29c/0x6d0
[ 121.095645][ T1334] __schedule+0x1396/0x2140
[ 121.096521][ T1334] preempt_schedule_irq+0x67/0xc0
[ 121.097456][ T1334] raw_irqentry_exit_cond_resched+0x30/0x40
[ 121.098560][ T1334] irqentry_exit+0x3f/0x50
[ 121.099386][ T1334] sysvec_apic_timer_interrupt+0x3e/0x80
[ 121.100489][ T1334] asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 121.101642][ T1334]
[ 121.102132][ T1334] The buggy address belongs to the object at
ffff888122354500
[ 121.102132][ T1334] which belongs to the cache mm_struct of size 1304
[ 121.106296][ T1334] The buggy address is located 240 bytes inside of
[ 121.106296][ T1334] freed 1304-byte region [ffff888122354500,
ffff888122354a18)
[ 121.109452][ T1334]
[ 121.109964][ T1334] The buggy address belongs to the physical page:
[ 121.111363][ T1334] page: refcount:0 mapcount:0
mapping:0000000000000000 index:0xffff888122354ac0 pfn:0x122350
[ 121.113626][ T1334] head: order:3 mapcount:0 entire_mapcount:0
nr_pages_mapped:-1 pincount:0
[ 121.115475][ T1334] memcg:ffff88811b750641
[ 121.116414][ T1334] flags: 0x200000000000240(workingset|head|node=0|zone=2)
[ 121.117949][ T1334] page_type: f5(slab)
[ 121.118815][ T1334] raw: 0200000000000240 ffff888100050640
ffff88810004e9d0 ffff88810004e9d0
[ 121.121931][ T1334] raw: ffff888122354ac0 000000000016000d
00000000f5000000 ffff88811b750641
[ 121.123820][ T1334] head: 0200000000000240 ffff888100050640
ffff88810004e9d0 ffff88810004e9d0
[ 121.125722][ T1334] head: ffff888122354ac0 000000000016000d
00000000f5000000 ffff88811b750641
[ 121.127596][ T1334] head: 0200000000000003 ffffea000488d401
ffffea00ffffffff 00000000ffffffff
[ 121.129469][ T1334] head: ffffffffffffffff 0000000000000000
00000000ffffffff 0000000000000008
[ 121.131335][ T1334] page dumped because: kasan: bad access detected
[ 121.132823][ T1334] page_owner tracks the page as allocated
[ 121.134065][ T1334] page last allocated via order 3, migratetype
Unmovable, gfp_mask
0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC),
pid 1165, tgid 1165 (bash), ts 98173533301, free_ts 98103770759
[ 121.139486][ T1334] post_alloc_hook+0x17a/0x190
[ 121.140534][ T1334] get_page_from_freelist+0x2edc/0x2f90
[ 121.141731][ T1334] __alloc_frozen_pages_noprof+0x1c1/0x4d0
[ 121.143022][ T1334] alloc_pages_mpol+0x14e/0x2b0
[ 121.144098][ T1334] alloc_frozen_pages_noprof+0xc4/0xf0
[ 121.145318][ T1334] allocate_slab+0x8f/0x280
[ 121.146296][ T1334] ___slab_alloc+0x3d5/0x8e0
[ 121.147292][ T1334] kmem_cache_alloc_noprof+0x229/0x370
[ 121.148489][ T1334] copy_mm+0xb7/0x400
[ 121.149355][ T1334] copy_process+0xe1f/0x2ac0
[ 121.150352][ T1334] kernel_clone+0x14b/0x540
[ 121.151330][ T1334] __x64_sys_clone+0x11d/0x150
[ 121.152472][ T1334] x64_sys_call+0x2c55/0x2fa0
[ 121.153491][ T1334] do_syscall_64+0x48/0x120
[ 121.154479][ T1334] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 121.156725][ T1334] page last free pid 1321 tgid 1321 stack trace:
[ 121.158098][ T1334] __free_frozen_pages+0xa1c/0xc80
[ 121.159205][ T1334] free_frozen_pages+0xc/0x20
[ 121.160229][ T1334] __free_slab+0xad/0xc0
[ 121.161149][ T1334] free_slab+0x17/0x100
[ 121.162102][ T1334] free_to_partial_list+0x48f/0x5b0
[ 121.163239][ T1334] __slab_free+0x1e5/0x240
[ 121.164210][ T1334] ___cache_free+0xb3/0xf0
[ 121.165174][ T1334] qlist_free_all+0xb7/0x160
[ 121.166176][ T1334] kasan_quarantine_reduce+0x14b/0x170
[ 121.167358][ T1334] __kasan_slab_alloc+0x1e/0x60
[ 121.168456][ T1334] kmem_cache_alloc_noprof+0x19e/0x370
[ 121.169638][ T1334] getname_flags+0x9c/0x490
[ 121.170631][ T1334] do_sys_openat2+0x55/0x100
[ 121.171629][ T1334] __x64_sys_openat+0xf4/0x120
[ 121.173713][ T1334] x64_sys_call+0x1ab/0x2fa0
[ 121.174716][ T1334] do_syscall_64+0x48/0x120
[ 121.175697][ T1334]
[ 121.176216][ T1334] Memory state around the buggy address:
[ 121.177433][ T1334] ffff888122354480: fc fc fc fc fc fc fc fc fc
fc fc fc fc fc fc fc
[ 121.179173][ T1334] ffff888122354500: fa fb fb fb fb fb fb fb fb
fb fb fb fb fb fb fb
[ 121.180922][ T1334] >ffff888122354580: fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb fb
[ 121.182718][ T1334]
^
[ 121.184634][ T1334] ffff888122354600: fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb fb
[ 121.186398][ T1334] ffff888122354680: fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb fb
[ 121.189204][ T1334]
==================================================================
```
View attachment "vma-lock-delay-inject.diff" of type "text/x-patch" (3621 bytes)
View attachment "vmalock-uaf.c" of type "text/x-csrc" (1876 bytes)
Powered by blists - more mailing lists