linux-kernel - Re: [PATCH 4/4] mm/madvise: remove redundant mmap_lock operations from process

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <794c4d29-8a75-48f2-a1d8-a519d0243290@lucifer.local>
Date: Tue, 11 Feb 2025 10:34:19 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: "Lai, Yi" <yi1.lai@...ux.intel.com>
Cc: SeongJae Park <sj@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        David Hildenbrand <david@...hat.com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Shakeel Butt <shakeel.butt@...ux.dev>,
        Vlastimil Babka <vbabka@...e.cz>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, yi1.lai@...el.com,
        Naresh Kamboju <naresh.kamboju@...aro.org>,
        Arnd Bergmann <arnd@...db.de>
Subject: Re: [PATCH 4/4] mm/madvise: remove redundant mmap_lock operations
 from process_madvise()

+cc Naresh, as [0] does not appear to be in lore which seems to be having issues
(meaning I can't reply to that thread at all), also +cc Arnd for his reply in [1].

[0]:https://lwn.net/ml/linux-mm/CA+G9fYt5QwJ4_F8fJj7jx9_0Le9kOVSeG38ox9qnKqwsrDdvHQ@mail.gmail.com/
[1]:https://lwn.net/ml/linux-mm/fa1a7a10-f892-4e7e-acb4-0b058aa53d88@app.fastmail.com/

The report here from Yi Lai (thanks for that!) appears to be the same thing.

SJ - I think the issue is we're unconditionally trying to unlock even in the
case of MADV_HWPOISON, MADV_SOFT_OFFLINE. E.g.:

static int madvise_lock(struct mm_struct *mm, int behavior)
{

#ifdef CONFIG_MEMORY_FAILURE
	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
		return 0;
#endif
	...
}

But on unlock...

static void madvise_unlock(struct mm_struct *mm, int behavior)
{
	if (madvise_need_mmap_write(behavior))
		mmap_write_unlock(mm);
	else
		mmap_read_unlock(mm);
}

So we just need to insert an equivalent line (ideally abstracted to a helper) on
unlock also.

OK having just typed that I realise SJ sent a fix already. Lore being broken
means I can't link to it. Sigh.

Will reply there.


On Tue, Feb 11, 2025 at 01:30:49PM +0800, Lai, Yi wrote:
> On Wed, Feb 05, 2025 at 10:15:17PM -0800, SeongJae Park wrote:
> > Optimize redundant mmap lock operations from process_madvise() by
> > directly doing the mmap locking first, and then the remaining works for
> > all ranges in the loop.
> >
> > Reviewed-by: Shakeel Butt <shakeel.butt@...ux.dev>
> > Signed-off-by: SeongJae Park <sj@...nel.org>
> > ---
> >  mm/madvise.c | 26 ++++++++++++++++++++++++--
> >  1 file changed, 24 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 31e5df75b926..5a0a1fc99d27 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1754,9 +1754,26 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> >
> >  	total_len = iov_iter_count(iter);
> >
> > +	ret = madvise_lock(mm, behavior);
> > +	if (ret)
> > +		return ret;
> > +
> >  	while (iov_iter_count(iter)) {
> > -		ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
> > -				 iter_iov_len(iter), behavior);
> > +		unsigned long start = (unsigned long)iter_iov_addr(iter);
> > +		size_t len_in = iter_iov_len(iter);
> > +		size_t len;
> > +
> > +		if (!is_valid_madvise(start, len_in, behavior)) {
> > +			ret = -EINVAL;
> > +			break;
> > +		}
> > +
> > +		len = PAGE_ALIGN(len_in);
> > +		if (start + len == start)
> > +			ret = 0;
> > +		else
> > +			ret = madvise_do_behavior(mm, start, len_in, len,
> > +					behavior);
> >  		/*
> >  		 * An madvise operation is attempting to restart the syscall,
> >  		 * but we cannot proceed as it would not be correct to repeat
> > @@ -1772,12 +1789,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> >  				ret = -EINTR;
> >  				break;
> >  			}
> > +
> > +			/* Drop and reacquire lock to unwind race. */
> > +			madvise_unlock(mm, behavior);
> > +			madvise_lock(mm, behavior);
> >  			continue;
> >  		}
> >  		if (ret < 0)
> >  			break;
> >  		iov_iter_advance(iter, iter_iov_len(iter));
> >  	}
> > +	madvise_unlock(mm, behavior);
> >
> >  	ret = (total_len - iov_iter_count(iter)) ? : ret;
> >
>
> Hi SeongJae Park,
>
> Greetings!
>
> I used Syzkaller and found that there is WARNING in madvise_unlock in linux-next tag - next-20250210.
>
> After bisection and the first bad commit is:
> "
> ec68fbd9e99f mm/madvise: remove redundant mmap_lock operations from process_madvise()
> "
>
> All detailed into can be found at:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock
> Syzkaller repro code:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock/repro.c
> Syzkaller repro syscall steps:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock/repro.prog
> Syzkaller report:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock/repro.report
> Kconfig(make olddefconfig):
> https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock/kconfig_origin
> Bisect info:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock/bisect_info.log
> bzImage:
> https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250210_144836_madvise_unlock/bzImage_df5d6180169ae06a2eac57e33b077ad6f6252440
> Issue dmesg:
> https://github.com/laifryiee/syzkaller_logs/blob/main/250210_144836_madvise_unlock/df5d6180169ae06a2eac57e33b077ad6f6252440_dmesg.log
>
> "
> [  135.191347] Injecting memory failure for pfn 0x8ea0 at process virtual address 0x20e7f000
> [  135.194964] Memory failure: 0x8ea0: recovery action for reserved kernel page: Ignored
> [  135.195584] ------------[ cut here ]------------
> [  135.195863] WARNING: CPU: 1 PID: 680 at ./include/linux/rwsem.h:203 madvise_unlock+0x17e/0x1a0
> [  135.196395] Modules linked in:
> [  135.196612] CPU: 1 UID: 0 PID: 680 Comm: repro Not tainted 6.14.0-rc2-next-20250210-df5d6180169a #1
> [  135.197135] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [  135.197818] RIP: 0010:madvise_unlock+0x17e/0x1a0
> [  135.198108] Code: a1 9f ff 31 f6 49 8d bc 24 e0 01 00 00 e8 fa 80 d5 03 31 ff 89 c3 89 c6 e8 9f 9b 9f ff 85 db 0f 85 1c ff ff ffe
> [  135.199154] RSP: 0018:ffff888022647e88 EFLAGS: 00010293
> [  135.199468] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff81e878f1
> [  135.199876] RDX: ffff88802264a540 RSI: ffffffff81e878fe RDI: 0000000000000005
> [  135.200285] RBP: ffff888022647ea0 R08: 0000000000000001 R09: ffffed10044c8f25
> [  135.200692] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888021116180
> [  135.201100] R13: ffff8880211162f0 R14: 0000000000004000 R15: ffff888021116180
> [  135.201531] FS:  00007f0327a07600(0000) GS:ffff88806c500000(0000) knlGS:0000000000000000
> [  135.201996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  135.202329] CR2: 00007f032773e7f0 CR3: 0000000021bd4003 CR4: 0000000000770ef0
> [  135.202737] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  135.203155] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [  135.203563] PKRU: 55555554
> [  135.203731] Call Trace:
> [  135.203883]  <TASK>
> [  135.204020]  ? show_regs+0x6d/0x80
> [  135.204240]  ? __warn+0xf3/0x390
> [  135.204447]  ? report_bug+0x25e/0x4b0
> [  135.204692]  ? madvise_unlock+0x17e/0x1a0
> [  135.204937]  ? report_bug+0x2cb/0x4b0
> [  135.205167]  ? madvise_unlock+0x17e/0x1a0
> [  135.205413]  ? madvise_unlock+0x17f/0x1a0
> [  135.205706]  ? handle_bug+0xf1/0x190
> [  135.206188]  ? exc_invalid_op+0x3c/0x80
> [  135.206423]  ? asm_exc_invalid_op+0x1f/0x30
> [  135.206678]  ? madvise_unlock+0x171/0x1a0
> [  135.206919]  ? madvise_unlock+0x17e/0x1a0
> [  135.207156]  ? madvise_unlock+0x17e/0x1a0
> [  135.207388]  ? madvise_unlock+0x17e/0x1a0
> [  135.207623]  do_madvise+0x14f/0x1a0
> [  135.207836]  __x64_sys_madvise+0xb2/0x120
> [  135.208067]  ? syscall_trace_enter+0x14f/0x280
> [  135.208328]  x64_sys_call+0x19b1/0x2150
> [  135.208552]  do_syscall_64+0x6d/0x140
> [  135.208766]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  135.209050] RIP: 0033:0x7f032763ee5d
> [  135.209261] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b8
> [  135.210261] RSP: 002b:00007ffcaafd26c8 EFLAGS: 00000207 ORIG_RAX: 000000000000001c
> [  135.210678] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f032763ee5d
> [  135.211069] RDX: 0000000000000064 RSI: 0000000000004000 RDI: 0000000020e7f000
> [  135.211458] RBP: 00007ffcaafd26d0 R08: 00007ffcaafd2140 R09: 00007ffcaafd2700
> [  135.211851] R10: 0000000000000000 R11: 0000000000000207 R12: 00007ffcaafd2828
> [  135.212248] R13: 00000000004016da R14: 0000000000403e08 R15: 00007f0327a4e000
> [  135.212656]  </TASK>
> [  135.212794] irq event stamp: 807
> [  135.212987] hardirqs last  enabled at (815): [<ffffffff81663b55>] __up_console_sem+0x95/0xb0
> [  135.213477] hardirqs last disabled at (824): [<ffffffff81663b3a>] __up_console_sem+0x7a/0xb0
> [  135.213951] softirqs last  enabled at (628): [<ffffffff8148c40e>] __irq_exit_rcu+0x10e/0x170
> [  135.214426] softirqs last disabled at (623): [<ffffffff8148c40e>] __irq_exit_rcu+0x10e/0x170
> [  135.214910] ---[ end trace 0000000000000000 ]---
> [  135.215182]
> [  135.215283] =====================================
> [  135.215550] WARNING: bad unlock balance detected!
> [  135.215817] 6.14.0-rc2-next-20250210-df5d6180169a #1 Tainted: G        W
> [  135.216234] -------------------------------------
> [  135.216502] repro/680 is trying to release lock (&mm->mmap_lock) at:
> [  135.216863] [<ffffffff81e87854>] madvise_unlock+0xd4/0x1a0
> [  135.217179] but there are no more locks to release!
> [  135.217457]
> [  135.217457] other info that might help us debug this:
> [  135.217819] no locks held by repro/680.
> [  135.218046]
> [  135.218046] stack backtrace:
> [  135.218296] CPU: 1 UID: 0 PID: 680 Comm: repro Tainted: G        W          6.14.0-rc2-next-20250210-df5d6180169a #1
> [  135.218308] Tainted: [W]=WARN
> [  135.218310] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [  135.218315] Call Trace:
> [  135.218317]  <TASK>
> [  135.218320]  dump_stack_lvl+0xea/0x150
> [  135.218330]  ? madvise_unlock+0xd4/0x1a0
> [  135.218341]  dump_stack+0x19/0x20
> [  135.218349]  print_unlock_imbalance_bug+0x1b5/0x200
> [  135.218368]  ? madvise_unlock+0xd4/0x1a0
> [  135.218379]  lock_release+0x5bc/0x870
> [  135.218386]  ? madvise_unlock+0x17f/0x1a0
> [  135.218397]  ? handle_bug+0xf1/0x190
> [  135.218407]  ? __pfx_lock_release+0x10/0x10
> [  135.218415]  ? exc_invalid_op+0x3c/0x80
> [  135.218426]  ? asm_exc_invalid_op+0x1f/0x30
> [  135.218441]  up_write+0x31/0x550
> [  135.218449]  ? madvise_unlock+0x17e/0x1a0
> [  135.218462]  madvise_unlock+0xd4/0x1a0
> [  135.218474]  do_madvise+0x14f/0x1a0
> [  135.218487]  __x64_sys_madvise+0xb2/0x120
> [  135.218500]  ? syscall_trace_enter+0x14f/0x280
> [  135.218511]  x64_sys_call+0x19b1/0x2150
> [  135.218522]  do_syscall_64+0x6d/0x140
> [  135.218531]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  135.218542] RIP: 0033:0x7f032763ee5d
> [  135.218548] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b8
> [  135.218556] RSP: 002b:00007ffcaafd26c8 EFLAGS: 00000207 ORIG_RAX: 000000000000001c
> [  135.218562] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f032763ee5d
> [  135.218569] RDX: 0000000000000064 RSI: 0000000000004000 RDI: 0000000020e7f000
> [  135.218573] RBP: 00007ffcaafd26d0 R08: 00007ffcaafd2140 R09: 00007ffcaafd2700
> [  135.218578] R10: 0000000000000000 R11: 0000000000000207 R12: 00007ffcaafd2828
> [  135.218582] R13: 00000000004016da R14: 0000000000403e08 R15: 00007f0327a4e000
> [  135.218594]  </TASK>
> [  135.228272] ------------[ cut here ]------------
> [  135.228541] DEBUG_RWSEMS_WARN_ON((rwsem_owner(sem) != current) && !rwsem_test_oflags(sem, RWSEM_NONSPINNABLE)): count = 0x0, magy
> [  135.229541] WARNING: CPU: 1 PID: 680 at kernel/locking/rwsem.c:1367 up_write+0x451/0x550
> [  135.229997] Modules linked in:
> [  135.230182] CPU: 1 UID: 0 PID: 680 Comm: repro Tainted: G        W          6.14.0-rc2-next-20250210-df5d6180169a #1
> [  135.230758] Tainted: [W]=WARN
> [  135.230940] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [  135.231565] RIP: 0010:up_write+0x451/0x550
> [  135.231806] Code: ea 03 80 3c 02 00 0f 85 d5 00 00 00 49 8b 14 24 53 4d 89 f9 4c 89 f1 48 c7 c6 40 bd eb 85 48 c7 c7 60 bc eb 858
> [  135.232814] RSP: 0018:ffff888022647e30 EFLAGS: 00010282
> [  135.233113] RAX: 0000000000000000 RBX: ffffffff85ebbba0 RCX: ffffffff8146cfd3
> [  135.233528] RDX: ffff88802264a540 RSI: ffffffff8146cfe0 RDI: 0000000000000001
> [  135.233929] RBP: ffff888022647e78 R08: 0000000000000001 R09: ffffed10044c8f66
> [  135.234325] R10: 0000000000000001 R11: 57525f4755424544 R12: ffff8880211162f0
> [  135.234722] R13: ffff8880211162f8 R14: ffff8880211162f0 R15: ffff88802264a540
> [  135.235122] FS:  00007f0327a07600(0000) GS:ffff88806c500000(0000) knlGS:0000000000000000
> [  135.235584] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  135.235912] CR2: 00007f032773e7f0 CR3: 0000000021bd4003 CR4: 0000000000770ef0
> [  135.236314] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  135.236713] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [  135.237114] PKRU: 55555554
> [  135.237276] Call Trace:
> [  135.237445]  <TASK>
> [  135.237580]  ? show_regs+0x6d/0x80
> [  135.237789]  ? __warn+0xf3/0x390
> [  135.237987]  ? find_bug+0x310/0x490
> [  135.238199]  ? up_write+0x451/0x550
> [  135.238412]  ? report_bug+0x2cb/0x4b0
> [  135.238636]  ? up_write+0x451/0x550
> [  135.238853]  ? up_write+0x452/0x550
> [  135.239065]  ? handle_bug+0xf1/0x190
> [  135.239285]  ? exc_invalid_op+0x3c/0x80
> [  135.239515]  ? asm_exc_invalid_op+0x1f/0x30
> [  135.239768]  ? __warn_printk+0x173/0x2e0
> [  135.240001]  ? __warn_printk+0x180/0x2e0
> [  135.240235]  ? up_write+0x451/0x550
> [  135.240445]  ? madvise_unlock+0x17e/0x1a0
> [  135.240686]  madvise_unlock+0xd4/0x1a0
> [  135.240913]  do_madvise+0x14f/0x1a0
> [  135.241126]  __x64_sys_madvise+0xb2/0x120
> [  135.241364]  ? syscall_trace_enter+0x14f/0x280
> [  135.241647]  x64_sys_call+0x19b1/0x2150
> [  135.241887]  do_syscall_64+0x6d/0x140
> [  135.242113]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  135.242407] RIP: 0033:0x7f032763ee5d
> [  135.242619] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b8
> [  135.243646] RSP: 002b:00007ffcaafd26c8 EFLAGS: 00000207 ORIG_RAX: 000000000000001c
> [  135.244075] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f032763ee5d
> [  135.244476] RDX: 0000000000000064 RSI: 0000000000004000 RDI: 0000000020e7f000
> [  135.244877] RBP: 00007ffcaafd26d0 R08: 00007ffcaafd2140 R09: 00007ffcaafd2700
> [  135.245279] R10: 0000000000000000 R11: 0000000000000207 R12: 00007ffcaafd2828
> [  135.245696] R13: 00000000004016da R14: 0000000000403e08 R15: 00007f0327a4e000
> [  135.246106]  </TASK>
> [  135.246242] irq event stamp: 857
> [  135.246434] hardirqs last  enabled at (857): [<ffffffff81663b55>] __up_console_sem+0x95/0xb0
> [  135.246915] hardirqs last disabled at (856): [<ffffffff81663b3a>] __up_console_sem+0x7a/0xb0
> [  135.247392] softirqs last  enabled at (628): [<ffffffff8148c40e>] __irq_exit_rcu+0x10e/0x170
> [  135.247871] softirqs last disabled at (623): [<ffffffff8148c40e>] __irq_exit_rcu+0x10e/0x170
> [  135.248347] ---[ end trace 0000000000000000 ]---
> "
>
> Hope this cound be insightful to you.
>
> Regards,
> Yi Lai
>
> ---
>
> If you don't need the following environment to reproduce the problem or if you
> already have one reproduced environment, please ignore the following information.
>
> How to reproduce:
> git clone https://gitlab.com/xupengfe/repro_vm_env.git
> cd repro_vm_env
> tar -xvf repro_vm_env.tar.gz
> cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
>   // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
>   // You could change the bzImage_xxx as you want
>   // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
> You could use below command to log in, there is no password for root.
> ssh -p 10023 root@...alhost
>
> After login vm(virtual machine) successfully, you could transfer reproduced
> binary to the vm by below way, and reproduce the problem in vm:
> gcc -pthread -o repro repro.c
> scp -P 10023 repro root@...alhost:/root/
>
> Get the bzImage for target kernel:
> Please use target kconfig and copy it to kernel_src/.config
> make olddefconfig
> make -jx bzImage           //x should equal or less than cpu num your pc has
>
> Fill the bzImage file into above start3.sh to load the target kernel in vm.
>
>
> Tips:
> If you already have qemu-system-x86_64, please ignore below info.
> If you want to install qemu v7.1.0 version:
> git clone https://github.com/qemu/qemu.git
> cd qemu
> git checkout -f v7.1.0
> mkdir build
> cd build
> yum install -y ninja-build.x86_64
> yum -y install libslirp-devel.x86_64
> ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
> make
> make install
>
> > --
> > 2.39.5