lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e8192bd7-6c74-4f3e-95d4-38adf56fd4fd@os.amperecomputing.com>
Date: Thu, 13 Mar 2025 10:42:45 -0700
From: Yang Shi <yang@...amperecomputing.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Vasily Gorbik <gor@...nel.org>, Andrew Morton
 <akpm@...ux-foundation.org>, Liam.Howlett@...cle.com, vbabka@...e.cz,
 jannh@...gle.com, oliver.sang@...el.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, Vasily Gorbik <gor@...ux.ibm.com>
Subject: Re: [v2 PATCH] mm: vma: skip anonymous vma when inserting vma to file
 rmap tree



On 3/12/25 10:16 PM, Lorenzo Stoakes wrote:
> On Wed, Mar 12, 2025 at 08:04:23PM -0700, Yang Shi wrote:
>>
>> On 3/12/25 4:55 PM, Vasily Gorbik wrote:
>>> On Wed, Mar 12, 2025 at 03:15:21PM -0700, Yang Shi wrote:
>>>> LKP reported 800% performance improvement for small-allocs benchmark
>>>> from vm-scalability [1] with patch ("/dev/zero: make private mapping
>>>> full anonymous mapping") [2], but the patch was nack'ed since it changes
>>>> the output of smaps somewhat.
>>> ...
>>>> ---
>>>> v2:
>>>>      * Added the comments in code suggested by Lorenzo
>>>>      * Collected R-b from Lorenze
>>>>
>>>>    mm/vma.c | 18 ++++++++++++++++--
>>>>    1 file changed, 16 insertions(+), 2 deletions(-)
>>> Hi Yang,
>>>
>>> Replying to v2, as the code is the same as v1 in linux-next:
>>>
>>> The LTP test "mmap10" consistently triggers a kernel NULL pointer
>>> dereference with this change, at least on x86 and s390. Reverting just
>>> this single patch from linux-next fixes the issue.
>> Hi Vasily,
>>
>> Thanks for the report. It is because dup_mmap() inserts the VMA into file
>> rmap by checking whether vma->vm_file is NULL or not. This splat can be
>> killed by skipping anonymous vma, but this actually will expose a more
>> severe problem. The struct file refcount may be imbalance. The refcount is
>> inc'ed in mmap, then inc'ed again by fork(), it is dec'ed when unmap or
>> process exit. If we skip refcount inc in fork, we need skip refcount dec in
>> unmap too, but there is still one refcount from mmap.
>>
>> Can we dec refcount in mmap if we see it is anonymous vma finally?
>> Unfortunately, no. If the refcount reaches 0, the struct file will be freed.
>> We will run into UAF when looking up smaps IIUC. It may point to anything.
>>
>> Lorenzo,
>>
>> This problem seems more complicated than what I thought in the first place.
>> Making it is a real anonymous vma (vm_file is NULL) may be still the best
>> option. But we need figure out how we can keep compatible smaps.
> Ugh lord. I am not in favour of this for reasons aforementioned, and I _really_
> don't want to special case this any more than we already do...

Yeah, understood. I meant we should find a way to make smaps unchanged 
or compatible.

>
> Let me think a bit about this also.
>
> Maybe if you're at LSF we can chat about it there?

Unfortunately I can't make it this year. Have a fun!

Thanks,
Yang

>
> Thanks!
>
>> Andrew,
>>
>> Can you please drop this patch from your tree?
>>
>> Thanks,
>> Yang
>>
>>> LTP: starting mmap10
>>> BUG: kernel NULL pointer dereference, address: 0000000000000008
>>> #PF: supervisor read access in kernel mode
>>> #PF: error_code(0x0000) - not-present page
>>> PGD 800000010d22a067 P4D 800000010d22a067 PUD 11ff09067 PMD 0
>>> Oops: Oops: 0000 [#1] PREEMPT SMP PTI
>>> CPU: 5 UID: 0 PID: 1719 Comm: mmap10 Not tainted 6.14.0-rc6-next-20250312 #3
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
>>> RIP: 0010:__rb_insert_augmented+0x2b/0x1d0
>>> Code: 0f 1e fa 48 89 f8 48 8b 3f 48 85 ff 0f 84 a4 01 00 00 41 55 49 89 f5 41 54 49 89 d4 55 53 48 8b 1f f6 c3 01 0f 85 e1 00 00 00 <48> 8b 53 08 48 39 fa 74 67 48 85 d2 74 09 f6 02 01 0f 84 a0 00 00
>>> RSP: 0018:ffffc90002b47cc8 EFLAGS: 00010246
>>> RAX: ffff8881143ab788 RBX: 0000000000000000 RCX: 00000000000009ff
>>> RDX: ffffffff814ad5d0 RSI: ffff888100bb5060 RDI: ffff8881143ab088
>>> RBP: ffff8881053af8c0 R08: ffff8881143ab700 R09: 00007ff6433f2000
>>> R10: 00007ff6433f2000 R11: ffff8881143ab000 R12: ffffffff814ad5d0
>>> R13: ffff888100bb5060 R14: ffff8881143ab700 R15: ffff8881143ab000
>>> FS:  00007ff643df1740(0000) GS:ffff8882b45bf000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000000000000008 CR3: 000000011b042000 CR4: 00000000000006f0
>>> Call Trace:
>>>    <TASK>
>>>    ? __die_body.cold+0x19/0x2b
>>>    ? page_fault_oops+0xc4/0x1f0
>>>    ? search_extable+0x26/0x30
>>>    ? search_module_extables+0x3f/0x60
>>>    ? exc_page_fault+0x6b/0x150
>>>    ? asm_exc_page_fault+0x26/0x30
>>>    ? __pfx_vma_interval_tree_augment_rotate+0x10/0x10
>>>    ? __pfx_vma_interval_tree_augment_rotate+0x10/0x10
>>>    ? __rb_insert_augmented+0x2b/0x1d0
>>>    copy_mm+0x48a/0x8c0
>>>    copy_process+0xf98/0x1930
>>>    kernel_clone+0xb7/0x3b0
>>>    __do_sys_clone+0x65/0x90
>>>    do_syscall_64+0x9e/0x1a0
>>>    entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>> RIP: 0033:0x7ff643eb2b00
>>> Code: 31 c0 31 d2 31 f6 bf 11 00 20 01 48 89 e5 53 48 83 ec 08 64 48 8b 04 25 10 00 00 00 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 89 c3 85 c0 75 31 64 48 8b 04 25 10 00 00
>>> RSP: 002b:00007ffdac219010 EFLAGS: 00000202 ORIG_RAX: 0000000000000038
>>> RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007ff643eb2b00
>>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
>>> RBP: 00007ffdac219020 R08: 0000000000000000 R09: 0000000000000000
>>> R10: 00007ff643df1a10 R11: 0000000000000202 R12: 0000000000000001
>>> R13: 0000000000000000 R14: 00007ff644036000 R15: 0000000000000000
>>>    </TASK>
>>> Modules linked in:
>>> CR2: 0000000000000008
>>> ---[ end trace 0000000000000000 ]---
>>> RIP: 0010:__rb_insert_augmented+0x2b/0x1d0
>>> Code: 0f 1e fa 48 89 f8 48 8b 3f 48 85 ff 0f 84 a4 01 00 00 41 55 49 89 f5 41 54 49 89 d4 55 53 48 8b 1f f6 c3 01 0f 85 e1 00 00 00 <48> 8b 53 08 48 39 fa 74 67 48 85 d2 74 09 f6 02 01 0f 84 a0 00 00
>>> RSP: 0018:ffffc90002b47cc8 EFLAGS: 00010246
>>> RAX: ffff8881143ab788 RBX: 0000000000000000 RCX: 00000000000009ff
>>> RDX: ffffffff814ad5d0 RSI: ffff888100bb5060 RDI: ffff8881143ab088
>>> RBP: ffff8881053af8c0 R08: ffff8881143ab700 R09: 00007ff6433f2000
>>> R10: 00007ff6433f2000 R11: ffff8881143ab000 R12: ffffffff814ad5d0
>>> R13: ffff888100bb5060 R14: ffff8881143ab700 R15: ffff8881143ab000
>>> FS:  00007ff643df1740(0000) GS:ffff8882b45bf000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000000000000008 CR3: 000000011b042000 CR4: 00000000000006f0
>>>
>>>
>>>
>>> LTP: starting mmap10
>>> Unable to handle kernel pointer dereference in virtual kernel address space
>>> Failing address: 0000000000000000 TEID: 0000000000000483
>>> Fault in home space mode while using kernel ASCE.
>>> AS:000000000247c007 R3:00000001ffffc007 S:00000001ffffb801 P:000000000000013d
>>> Oops: 0004 ilc:3 [#1] SMP
>>> Modules linked in:
>>> CPU: 0 UID: 0 PID: 665 Comm: mmap10 Not tainted 6.14.0-rc6-next-20250312 #16
>>> Hardware name: IBM 3931 A01 704 (KVM/Linux)
>>> Krnl PSW : 0704c00180000000 000003ffe0ee0440 (__rb_insert_augmented+0x60/0x210)
>>>              R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>>> Krnl GPRS: 00000000009ff000 0000000000000000 000000008e5f7508 0000000084a7ed08
>>>              00000000000009fe 0000000000000000 0000000000000000 0000037fe06c7b68
>>>              00000000801d0e90 000003ffe04158d0 0000000084a7ed08 0000000000000000
>>>              000003ffbb700000 00000000801d0e48 000003ffe0ee057c 0000037fe06c7a40
>>> Krnl Code: 000003ffe0ee0430: e31030080004        lg      %r1,8(%r3)
>>>              000003ffe0ee0436: ec1200888064        cgrj    %r1,%r2,8,000003ffe0ee0546
>>>             #000003ffe0ee043c: b90400a3            lgr     %r10,%r3
>>>             >000003ffe0ee0440: e310b0100024        stg     %r1,16(%r11)
>>>              000003ffe0ee0446: e3b030080024        stg     %r11,8(%r3)
>>>              000003ffe0ee044c: ec180009007c        cgij    %r1,0,8,000003ffe0ee045e
>>>              000003ffe0ee0452: ec2b000100d9        aghik   %r2,%r11,1
>>>              000003ffe0ee0458: e32010000024        stg     %r2,0(%r1)
>>> Call Trace:
>>>    [<000003ffe0ee0440>] __rb_insert_augmented+0x60/0x210
>>>    [<000003ffe016d6c4>] dup_mmap+0x424/0x8c0
>>>    [<000003ffe016dc62>] copy_mm+0x102/0x1c0
>>>    [<000003ffe016e8ae>] copy_process+0x7ce/0x12b0
>>>    [<000003ffe016f458>] kernel_clone+0x68/0x380
>>>    [<000003ffe016f84a>] __do_sys_clone+0x5a/0x70
>>>    [<000003ffe016faa0>] __s390x_sys_clone+0x40/0x50
>>>    [<000003ffe011c9b6>] do_syscall.constprop.0+0x116/0x140
>>>    [<000003ffe0ef1d64>] __do_syscall+0xd4/0x1c0
>>>    [<000003ffe0efd044>] system_call+0x74/0x98
>>> Last Breaking-Event-Address:
>>>    [<000003ffe0ee058a>] __rb_insert_augmented+0x1aa/0x210
>>> Kernel panic - not syncing: Fatal exception: panic_on_oops


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ