linux-kernel - [Question] Call trace occurs occasionally when a rollback is performed upon CPU online timeout

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <8ec1a40a-61a6-d8f3-074d-6cc8697f261d@huawei.com>
Date: Wed, 15 Jan 2025 20:32:37 +0800
From: Kunkun Jiang <jiangkunkun@...wei.com>
To: Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
	Mark Rutland <mark.rutland@....com>, Jonathan Cameron
	<Jonathan.Cameron@...wei.com>, Gavin Shan <gshan@...hat.com>, James Morse
	<james.morse@....com>, Jean-Philippe Brucker <jean-philippe@...aro.org>,
	Jinjie Ruan <ruanjinjie@...wei.com>, Douglas Anderson
	<dianders@...omium.org>, Puranjay Mohan <puranjay@...nel.org>, Luchunhua
	<luchunhua@...wei.com>
CC: "moderated list:ARM SMMU DRIVERS" <linux-arm-kernel@...ts.infradead.org>,
	open list <linux-kernel@...r.kernel.org>, "wanghaibin.wang@...wei.com"
	<wanghaibin.wang@...wei.com>, Zenghui Yu <yuzenghui@...wei.com>,
	<wangzhou1@...ilicon.com>
Subject: [Question] Call trace occurs occasionally when a rollback is
 performed upon CPU online timeout

Hi all,

I have a question about CPU online/offline. In the following test 
scenario, various tasks(iperf,fio,sve,...) are executed in a VM with 6 
vCPUs. At the same time, repeat online/offline operations on two of the 
vCPUs through /sys/devices/system/cpu/cpuX/online. After running for 
many hours,some calltrace will appear in the guest.
The first, WARN_ON_ONCE(test_bit(KTHREAD_SHOULD_PARK, &kthread->flags)) 
is triggered.
> Call trace:
> kthread_park+0xd0/0xdc
> takedown_cpu+0x4c/0x140
> cpuhp_invoke_callback+0x160/0x6e0
> _cpu_up+0x1a4/0x200
> cpu_up+0xbc/0x100
> cpu_device_up+0x20/0x30
> cpu_subsys_online+0x4c/0xb0
> device_online+0x7c/0xa0
> online_store+0xd0/0xe0
> dev_attr_store+0x20/0x34
> sysfs_kf_write+0x4c/0x5c
> kernfs_fop_write_iter+0x130/0x1c0
> new_sync_write+0xec/0x18c
> vfs_write+0x214/0x2ac
> ksys_write+0x70/0xfc
> __arm64_sys_write+0x24/0x30
> invoke_syscall+0x50/0x11c
> el0_svc_common.constprop.0+0x68/0x164
> do_el0_svc+0x34/0xcc
> el0_svc+0x20/0x30
> el0_sync_handler+0xb8/0xc0
> el0_sync+0x160/0x180

The second, BUG_ON(!irqs_disabled() && !IS_ENABLED(CONFIG_PREEMPT_RT)) 
is triggered.
> Call trace:
> irq_work_run_list+0x64/0x70
> smpcfd_dying_cpu+0x24/0x34
> cpuhp_invoke_callback+0x160/0x6e0
> _cpu_up+0x1a4/0x200
> cpu_up+0xbc/0x100
> cpu_device_up+0x20/0x30
> cpu_subsys_online+0x4c/0xb0
> device_online+0x7c/0xa0
> online_store+0xd0/0xe0
> dev_attr_store+0x20/0x34
> sysfs_kf_write+0x4c/0x5c
> kernfs_fop_write_iter+0x130/0x1c0
> new_sync_write+0xec/0x18c
> vfs_write+0x214/0x2ac
> ksys_write+0x70/0xfc
> __arm64_sys_write+0x24/0x30
> invoke_syscall+0x50/0x11c
> el0_svc_common.constprop.0+0x68/0x164
> do_el0_svc+0x34/0xcc
> el0_svc+0x20/0x30
> el0_sync_handler+0xb8/0xc0
> el0_sync+0x160/0x180

According to my analysis, the root cause of the question is because the 
vCPU online times out, but in fact the vCPU was successfully online. 
Rollback is performed due to timeout. During the rollback, the 
secondary_start_kernel is still executing, resulting in the above call 
trace. So is this a bug? If so, how should it be repaired?

The reason for the timeout has not been found. It is suspected that it 
is caused by excessive task pressure. If you have other ideas, please 
point them out.

Thanks,
Kunkun Jiang