lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <8ec1a40a-61a6-d8f3-074d-6cc8697f261d@huawei.com>
Date: Wed, 15 Jan 2025 20:32:37 +0800
From: Kunkun Jiang <jiangkunkun@...wei.com>
To: Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
	Mark Rutland <mark.rutland@....com>, Jonathan Cameron
	<Jonathan.Cameron@...wei.com>, Gavin Shan <gshan@...hat.com>, James Morse
	<james.morse@....com>, Jean-Philippe Brucker <jean-philippe@...aro.org>,
	Jinjie Ruan <ruanjinjie@...wei.com>, Douglas Anderson
	<dianders@...omium.org>, Puranjay Mohan <puranjay@...nel.org>, Luchunhua
	<luchunhua@...wei.com>
CC: "moderated list:ARM SMMU DRIVERS" <linux-arm-kernel@...ts.infradead.org>,
	open list <linux-kernel@...r.kernel.org>, "wanghaibin.wang@...wei.com"
	<wanghaibin.wang@...wei.com>, Zenghui Yu <yuzenghui@...wei.com>,
	<wangzhou1@...ilicon.com>
Subject: [Question] Call trace occurs occasionally when a rollback is
 performed upon CPU online timeout

Hi all,

I have a question about CPU online/offline. In the following test 
scenario, various tasks(iperf,fio,sve,...) are executed in a VM with 6 
vCPUs. At the same time, repeat online/offline operations on two of the 
vCPUs through /sys/devices/system/cpu/cpuX/online. After running for 
many hours,some calltrace will appear in the guest.
The first, WARN_ON_ONCE(test_bit(KTHREAD_SHOULD_PARK, &kthread->flags)) 
is triggered.
> Call trace:
> kthread_park+0xd0/0xdc
> takedown_cpu+0x4c/0x140
> cpuhp_invoke_callback+0x160/0x6e0
> _cpu_up+0x1a4/0x200
> cpu_up+0xbc/0x100
> cpu_device_up+0x20/0x30
> cpu_subsys_online+0x4c/0xb0
> device_online+0x7c/0xa0
> online_store+0xd0/0xe0
> dev_attr_store+0x20/0x34
> sysfs_kf_write+0x4c/0x5c
> kernfs_fop_write_iter+0x130/0x1c0
> new_sync_write+0xec/0x18c
> vfs_write+0x214/0x2ac
> ksys_write+0x70/0xfc
> __arm64_sys_write+0x24/0x30
> invoke_syscall+0x50/0x11c
> el0_svc_common.constprop.0+0x68/0x164
> do_el0_svc+0x34/0xcc
> el0_svc+0x20/0x30
> el0_sync_handler+0xb8/0xc0
> el0_sync+0x160/0x180

The second, BUG_ON(!irqs_disabled() && !IS_ENABLED(CONFIG_PREEMPT_RT)) 
is triggered.
> Call trace:
> irq_work_run_list+0x64/0x70
> smpcfd_dying_cpu+0x24/0x34
> cpuhp_invoke_callback+0x160/0x6e0
> _cpu_up+0x1a4/0x200
> cpu_up+0xbc/0x100
> cpu_device_up+0x20/0x30
> cpu_subsys_online+0x4c/0xb0
> device_online+0x7c/0xa0
> online_store+0xd0/0xe0
> dev_attr_store+0x20/0x34
> sysfs_kf_write+0x4c/0x5c
> kernfs_fop_write_iter+0x130/0x1c0
> new_sync_write+0xec/0x18c
> vfs_write+0x214/0x2ac
> ksys_write+0x70/0xfc
> __arm64_sys_write+0x24/0x30
> invoke_syscall+0x50/0x11c
> el0_svc_common.constprop.0+0x68/0x164
> do_el0_svc+0x34/0xcc
> el0_svc+0x20/0x30
> el0_sync_handler+0xb8/0xc0
> el0_sync+0x160/0x180

According to my analysis, the root cause of the question is because the 
vCPU online times out, but in fact the vCPU was successfully online. 
Rollback is performed due to timeout. During the rollback, the 
secondary_start_kernel is still executing, resulting in the above call 
trace. So is this a bug? If so, how should it be repaired?

The reason for the timeout has not been found. It is suspected that it 
is caused by excessive task pressure. If you have other ideas, please 
point them out.

Thanks,
Kunkun Jiang

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ