linux-kernel - Re: [PATCH v2] vmstat: disable vmstat_work on vmstat_cpu_down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ff6461df-25d1-494f-ad34-763faf249309@lucifer.local>
Date: Fri, 3 Jan 2025 23:33:19 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Koichiro Den <koichiro.den@...onical.com>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] vmstat: disable vmstat_work on vmstat_cpu_down_prep()

On Sat, Dec 21, 2024 at 12:33:20PM +0900, Koichiro Den wrote:
> Even after mm/vmstat:online teardown, shepherd may still queue work for
> the dying cpu until the cpu is removed from online mask. While it's
> quite rare, this means that after unbind_workers() unbinds a per-cpu
> kworker, it potentially runs vmstat_update for the dying CPU on an
> irrelevant cpu before entering atomic AP states.
> When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the
> backtrace.
>
>   BUG: using smp_processor_id() in preemptible [00000000] code: \
>                                                kworker/7:3/1702
>   caller is refresh_cpu_vm_stats+0x235/0x5f0
>   CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G
>   Tainted: [N]=TEST
>   Workqueue: mm_percpu_wq vmstat_update
>   Call Trace:
>    <TASK>
>    dump_stack_lvl+0x8d/0xb0
>    check_preemption_disabled+0xce/0xe0
>    refresh_cpu_vm_stats+0x235/0x5f0
>    vmstat_update+0x17/0xa0
>    process_one_work+0x869/0x1aa0
>    worker_thread+0x5e5/0x1100
>    kthread+0x29e/0x380
>    ret_from_fork+0x2d/0x70
>    ret_from_fork_asm+0x1a/0x30
>    </TASK>
>
> So, for mm/vmstat:online, disable vmstat_work reliably on teardown and
> symmetrically enable it on startup.
>
> Signed-off-by: Koichiro Den <koichiro.den@...onical.com>

Hi,

I observed a warning in my qemu and real hardware, which I bisected to this commit:

[    0.087733] ------------[ cut here ]------------
[    0.087733] workqueue: work disable count underflowed
[    0.087733] WARNING: CPU: 1 PID: 21 at kernel/workqueue.c:4313 enable_work+0xb5/0xc0

This is:

static void work_offqd_enable(struct work_offq_data *offqd)
{
	if (likely(offqd->disable > 0))
		offqd->disable--;
	else
		WARN_ONCE(true, "workqueue: work disable count underflowed\n"); <-- this line
}

So (based on this code) presumably an enable is only required if previously
disabled, and this code is being called on startup unconditionally without
the work having been disabled previously? I'm not hugely familiar with
delayed workqueue implementation details.

[    0.087733] Modules linked in:
[    0.087733] CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Not tainted 6.13.0-rc4+ #58
[    0.087733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    0.087733] RIP: 0010:enable_work+0xb5/0xc0
[    0.087733] Code: 6f b8 01 00 74 0f 31 d2 be 01 00 00 00 eb b5 90 0f 0b 90 eb ca c6 05 60 6f b8 01 01 90 48 c7 c7 b0 a9 6e 82 e8 4c a4 fd ff 90 <0f> 0b 90 90 eb d6 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90
[    0.087733] RSP: 0018:ffffc900000cbe30 EFLAGS: 00010092
[    0.087733] RAX: 0000000000000029 RBX: ffff888263ca9d60 RCX: 0000000000000000
[    0.087733] RDX: 0000000000000001 RSI: ffffc900000cbce8 RDI: 0000000000000001
[    0.087733] RBP: ffffc900000cbe30 R08: 00000000ffffdfff R09: ffffffff82b12f08
[    0.087733] R10: 0000000000000003 R11: 0000000000000002 R12: 00000000000000c4
[    0.087733] R13: ffffffff81278d90 R14: 0000000000000000 R15: ffff888263c9c648
[    0.087733] FS:  0000000000000000(0000) GS:ffff888263c80000(0000) knlGS:0000000000000000
[    0.087733] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.087733] CR2: 0000000000000000 CR3: 0000000002a2e000 CR4: 0000000000750ef0
[    0.087733] PKRU: 55555554
[    0.087733] Call Trace:
[    0.087733]  <TASK>
[    0.087733]  ? enable_work+0xb5/0xc0
[    0.087733]  ? __warn.cold+0x93/0xf2
[    0.087733]  ? enable_work+0xb5/0xc0
[    0.087733]  ? report_bug+0xff/0x140
[    0.087733]  ? handle_bug+0x54/0x90
[    0.087733]  ? exc_invalid_op+0x17/0x70
[    0.087733]  ? asm_exc_invalid_op+0x1a/0x20
[    0.087733]  ? __pfx_vmstat_cpu_online+0x10/0x10
[    0.087733]  ? enable_work+0xb5/0xc0
[    0.087733]  vmstat_cpu_online+0x5c/0x70
[    0.087733]  cpuhp_invoke_callback+0x133/0x440
[    0.087733]  cpuhp_thread_fun+0x95/0x150
[    0.087733]  smpboot_thread_fn+0xd5/0x1d0
[    0.087734]  ? __pfx_smpboot_thread_fn+0x10/0x10
[    0.087735]  kthread+0xc8/0xf0
[    0.087737]  ? __pfx_kthread+0x10/0x10
[    0.087738]  ret_from_fork+0x2c/0x50
[    0.087739]  ? __pfx_kthread+0x10/0x10
[    0.087740]  ret_from_fork_asm+0x1a/0x30
[    0.087742]  </TASK>
[    0.087742] ---[ end trace 0000000000000000 ]---


> ---
> v1: https://lore.kernel.org/all/20241220134234.3809621-1-koichiro.den@canonical.com/
> ---
>  mm/vmstat.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4d016314a56c..0889b75cef14 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2148,13 +2148,14 @@ static int vmstat_cpu_online(unsigned int cpu)
>  	if (!node_state(cpu_to_node(cpu), N_CPU)) {
>  		node_set_state(cpu_to_node(cpu), N_CPU);
>  	}
> +	enable_delayed_work(&per_cpu(vmstat_work, cpu));

Probably needs to be 'if disabled' here, as this is invoked on normal
startup when the work won't have been disabled?

Had a brief look at code and couldn't see how that could be done
however... and one would need to be careful about races... Tricky!

>
>  	return 0;
>  }
>
>  static int vmstat_cpu_down_prep(unsigned int cpu)
>  {
> -	cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> +	disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
>  	return 0;
>  }
>
> --
> 2.43.0
>
>

Let me know if you need any more details, .config etc.

I noticed this warning on a real box too (in both cases running akpm's
mm-unstable branch), FWIW.