[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <459212b0-6440-48c4-b7ae-47be46f17089@os.amperecomputing.com>
Date: Wed, 12 Mar 2025 20:04:23 -0700
From: Yang Shi <yang@...amperecomputing.com>
To: Vasily Gorbik <gor@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>
Cc: Liam.Howlett@...cle.com, lorenzo.stoakes@...cle.com, vbabka@...e.cz,
jannh@...gle.com, oliver.sang@...el.com, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Vasily Gorbik <gor@...ux.ibm.com>
Subject: Re: [v2 PATCH] mm: vma: skip anonymous vma when inserting vma to file
rmap tree
On 3/12/25 4:55 PM, Vasily Gorbik wrote:
> On Wed, Mar 12, 2025 at 03:15:21PM -0700, Yang Shi wrote:
>> LKP reported 800% performance improvement for small-allocs benchmark
>> from vm-scalability [1] with patch ("/dev/zero: make private mapping
>> full anonymous mapping") [2], but the patch was nack'ed since it changes
>> the output of smaps somewhat.
> ...
>> ---
>> v2:
>> * Added the comments in code suggested by Lorenzo
>> * Collected R-b from Lorenze
>>
>> mm/vma.c | 18 ++++++++++++++++--
>> 1 file changed, 16 insertions(+), 2 deletions(-)
> Hi Yang,
>
> Replying to v2, as the code is the same as v1 in linux-next:
>
> The LTP test "mmap10" consistently triggers a kernel NULL pointer
> dereference with this change, at least on x86 and s390. Reverting just
> this single patch from linux-next fixes the issue.
Hi Vasily,
Thanks for the report. It is because dup_mmap() inserts the VMA into
file rmap by checking whether vma->vm_file is NULL or not. This splat
can be killed by skipping anonymous vma, but this actually will expose a
more severe problem. The struct file refcount may be imbalance. The
refcount is inc'ed in mmap, then inc'ed again by fork(), it is dec'ed
when unmap or process exit. If we skip refcount inc in fork, we need
skip refcount dec in unmap too, but there is still one refcount from mmap.
Can we dec refcount in mmap if we see it is anonymous vma finally?
Unfortunately, no. If the refcount reaches 0, the struct file will be
freed. We will run into UAF when looking up smaps IIUC. It may point to
anything.
Lorenzo,
This problem seems more complicated than what I thought in the first
place. Making it is a real anonymous vma (vm_file is NULL) may be still
the best option. But we need figure out how we can keep compatible smaps.
Andrew,
Can you please drop this patch from your tree?
Thanks,
Yang
>
> LTP: starting mmap10
> BUG: kernel NULL pointer dereference, address: 0000000000000008
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 800000010d22a067 P4D 800000010d22a067 PUD 11ff09067 PMD 0
> Oops: Oops: 0000 [#1] PREEMPT SMP PTI
> CPU: 5 UID: 0 PID: 1719 Comm: mmap10 Not tainted 6.14.0-rc6-next-20250312 #3
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
> RIP: 0010:__rb_insert_augmented+0x2b/0x1d0
> Code: 0f 1e fa 48 89 f8 48 8b 3f 48 85 ff 0f 84 a4 01 00 00 41 55 49 89 f5 41 54 49 89 d4 55 53 48 8b 1f f6 c3 01 0f 85 e1 00 00 00 <48> 8b 53 08 48 39 fa 74 67 48 85 d2 74 09 f6 02 01 0f 84 a0 00 00
> RSP: 0018:ffffc90002b47cc8 EFLAGS: 00010246
> RAX: ffff8881143ab788 RBX: 0000000000000000 RCX: 00000000000009ff
> RDX: ffffffff814ad5d0 RSI: ffff888100bb5060 RDI: ffff8881143ab088
> RBP: ffff8881053af8c0 R08: ffff8881143ab700 R09: 00007ff6433f2000
> R10: 00007ff6433f2000 R11: ffff8881143ab000 R12: ffffffff814ad5d0
> R13: ffff888100bb5060 R14: ffff8881143ab700 R15: ffff8881143ab000
> FS: 00007ff643df1740(0000) GS:ffff8882b45bf000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 000000011b042000 CR4: 00000000000006f0
> Call Trace:
> <TASK>
> ? __die_body.cold+0x19/0x2b
> ? page_fault_oops+0xc4/0x1f0
> ? search_extable+0x26/0x30
> ? search_module_extables+0x3f/0x60
> ? exc_page_fault+0x6b/0x150
> ? asm_exc_page_fault+0x26/0x30
> ? __pfx_vma_interval_tree_augment_rotate+0x10/0x10
> ? __pfx_vma_interval_tree_augment_rotate+0x10/0x10
> ? __rb_insert_augmented+0x2b/0x1d0
> copy_mm+0x48a/0x8c0
> copy_process+0xf98/0x1930
> kernel_clone+0xb7/0x3b0
> __do_sys_clone+0x65/0x90
> do_syscall_64+0x9e/0x1a0
> entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7ff643eb2b00
> Code: 31 c0 31 d2 31 f6 bf 11 00 20 01 48 89 e5 53 48 83 ec 08 64 48 8b 04 25 10 00 00 00 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 89 c3 85 c0 75 31 64 48 8b 04 25 10 00 00
> RSP: 002b:00007ffdac219010 EFLAGS: 00000202 ORIG_RAX: 0000000000000038
> RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007ff643eb2b00
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
> RBP: 00007ffdac219020 R08: 0000000000000000 R09: 0000000000000000
> R10: 00007ff643df1a10 R11: 0000000000000202 R12: 0000000000000001
> R13: 0000000000000000 R14: 00007ff644036000 R15: 0000000000000000
> </TASK>
> Modules linked in:
> CR2: 0000000000000008
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:__rb_insert_augmented+0x2b/0x1d0
> Code: 0f 1e fa 48 89 f8 48 8b 3f 48 85 ff 0f 84 a4 01 00 00 41 55 49 89 f5 41 54 49 89 d4 55 53 48 8b 1f f6 c3 01 0f 85 e1 00 00 00 <48> 8b 53 08 48 39 fa 74 67 48 85 d2 74 09 f6 02 01 0f 84 a0 00 00
> RSP: 0018:ffffc90002b47cc8 EFLAGS: 00010246
> RAX: ffff8881143ab788 RBX: 0000000000000000 RCX: 00000000000009ff
> RDX: ffffffff814ad5d0 RSI: ffff888100bb5060 RDI: ffff8881143ab088
> RBP: ffff8881053af8c0 R08: ffff8881143ab700 R09: 00007ff6433f2000
> R10: 00007ff6433f2000 R11: ffff8881143ab000 R12: ffffffff814ad5d0
> R13: ffff888100bb5060 R14: ffff8881143ab700 R15: ffff8881143ab000
> FS: 00007ff643df1740(0000) GS:ffff8882b45bf000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 000000011b042000 CR4: 00000000000006f0
>
>
>
> LTP: starting mmap10
> Unable to handle kernel pointer dereference in virtual kernel address space
> Failing address: 0000000000000000 TEID: 0000000000000483
> Fault in home space mode while using kernel ASCE.
> AS:000000000247c007 R3:00000001ffffc007 S:00000001ffffb801 P:000000000000013d
> Oops: 0004 ilc:3 [#1] SMP
> Modules linked in:
> CPU: 0 UID: 0 PID: 665 Comm: mmap10 Not tainted 6.14.0-rc6-next-20250312 #16
> Hardware name: IBM 3931 A01 704 (KVM/Linux)
> Krnl PSW : 0704c00180000000 000003ffe0ee0440 (__rb_insert_augmented+0x60/0x210)
> R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> Krnl GPRS: 00000000009ff000 0000000000000000 000000008e5f7508 0000000084a7ed08
> 00000000000009fe 0000000000000000 0000000000000000 0000037fe06c7b68
> 00000000801d0e90 000003ffe04158d0 0000000084a7ed08 0000000000000000
> 000003ffbb700000 00000000801d0e48 000003ffe0ee057c 0000037fe06c7a40
> Krnl Code: 000003ffe0ee0430: e31030080004 lg %r1,8(%r3)
> 000003ffe0ee0436: ec1200888064 cgrj %r1,%r2,8,000003ffe0ee0546
> #000003ffe0ee043c: b90400a3 lgr %r10,%r3
> >000003ffe0ee0440: e310b0100024 stg %r1,16(%r11)
> 000003ffe0ee0446: e3b030080024 stg %r11,8(%r3)
> 000003ffe0ee044c: ec180009007c cgij %r1,0,8,000003ffe0ee045e
> 000003ffe0ee0452: ec2b000100d9 aghik %r2,%r11,1
> 000003ffe0ee0458: e32010000024 stg %r2,0(%r1)
> Call Trace:
> [<000003ffe0ee0440>] __rb_insert_augmented+0x60/0x210
> [<000003ffe016d6c4>] dup_mmap+0x424/0x8c0
> [<000003ffe016dc62>] copy_mm+0x102/0x1c0
> [<000003ffe016e8ae>] copy_process+0x7ce/0x12b0
> [<000003ffe016f458>] kernel_clone+0x68/0x380
> [<000003ffe016f84a>] __do_sys_clone+0x5a/0x70
> [<000003ffe016faa0>] __s390x_sys_clone+0x40/0x50
> [<000003ffe011c9b6>] do_syscall.constprop.0+0x116/0x140
> [<000003ffe0ef1d64>] __do_syscall+0xd4/0x1c0
> [<000003ffe0efd044>] system_call+0x74/0x98
> Last Breaking-Event-Address:
> [<000003ffe0ee058a>] __rb_insert_augmented+0x1aa/0x210
> Kernel panic - not syncing: Fatal exception: panic_on_oops
Powered by blists - more mailing lists