linux-kernel - 答复: [PATCH] cpu: fix hard lockup triggered during stress-ng stress testing.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2eeb109f50b34e628327972d1af57f5d@xfusion.com>
Date: Fri, 19 Sep 2025 12:05:38 +0000
From: shechenglong <shechenglong@...sion.com>
To: Catalin Marinas <catalin.marinas@....com>
CC: "will@...nel.org" <will@...nel.org>,
	"linux-arm-kernel@...ts.infradead.org"
	<linux-arm-kernel@...ts.infradead.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, xulei <stone.xulei@...sion.com>, chenjialong
	<chenjialong@...sion.com>, yuxiating <yuxiating@...sion.com>
Subject: 答复: [PATCH] cpu: fix hard lockup triggered during stress-ng stress testing.

Okay, understood. Thank you! May I ask when the fix/patch is expected to be available?

-----邮件原件-----
发件人: Catalin Marinas <catalin.marinas@....com> 
发送时间: 2025年9月18日 19:28
收件人: shechenglong <shechenglong@...sion.com>
抄送: will@...nel.org; linux-arm-kernel@...ts.infradead.org; linux-kernel@...r.kernel.org; xulei <stone.xulei@...sion.com>; chenjialong <chenjialong@...sion.com>; yuxiating <yuxiating@...sion.com>
主题: Re: [PATCH] cpu: fix hard lockup triggered during stress-ng stress testing.

On Thu, Sep 18, 2025 at 02:49:07PM +0800, shechenglong wrote:
> Context of the Issue:
> In an ARM64 environment, the following steps were performed:
> 
> 1. Repeatedly ran stress-ng to stress the CPU, memory, and I/O.
> 2. Cyclically executed test case pty06 from the LTP test suite.
> 3. Added mitigations=off to the GRUB parameters.
> 
> After 1–2 hours of stress testing, a hardlockup occurred, causing a 
> system crash.
> 
> Root Cause of the Hardlockup:
> Each time stress-ng starts, it invokes the 
> /sys/kernel/debug/clear_warn_once interface, which clears the values 
> in the memory section from __start_once to __end_once. This caused 
> functions like pr_info_once() — originally designed to print only once — to print again every time stress-ng was called.
> If the pty06 test case happened to be using the serial module at that 
> same moment, it would sleep in waiter.list within the __down_common function.
> 
> After pr_info_once() completed its output using the serial module, it 
> invoked the semaphore up() function to wake up the process waiting in 
> waiter.list. This sequence triggered an A-A deadlock, ultimately 
> leading to a hardlockup and system crash.
> 
> To prevent this, a local variable should be used to control and ensure 
> the print operation occurs only once.
> 
> Hard lockup call stack:
> 
> _raw_spin_lock_nested+168
> ttwu_queue+180 （rq_lock(rq, &rf); 2nd acquiring the rq->__lock）
> try_to_wake_up+548
> wake_up_process+32
> __up+88
> up+100
> __up_console_sem+96
> console_unlock+696
> vprintk_emit+428
> vprintk_default+64
> vprintk_func+220
> printk+104
> spectre_v4_enable_task_mitigation+344
> __switch_to+100
> __schedule+1028 (rq_lock(rq, &rf); 1st acquiring the rq->__lock)
> schedule_idle+48
> do_idle+388
> cpu_startup_entry+44
> secondary_start_kernel+352

Is the problem actually that we call the spectre v4 stuff on the
switch_to() path (we can't change this) under the rq_lock() and it subsequently calls printk() which takes the console semaphore? I think the "once" aspect makes it less likely but does not address the actual problem.

> diff --git a/arch/arm64/kernel/proton-pack.c 
> b/arch/arm64/kernel/proton-pack.c index edf1783ffc81..f8663157e041 
> 100644
> --- a/arch/arm64/kernel/proton-pack.c
> +++ b/arch/arm64/kernel/proton-pack.c
> @@ -424,8 +424,10 @@ static bool spectre_v4_mitigations_off(void)
>  	bool ret = cpu_mitigations_off() ||
>  		   __spectre_v4_policy == SPECTRE_V4_POLICY_MITIGATION_DISABLED;
>  
> -	if (ret)
> -		pr_info_once("spectre-v4 mitigation disabled by command-line option\n");
> +	static atomic_t __printk_once = ATOMIC_INIT(0);
> +
> +	if (ret && !atomic_cmpxchg(&__printk_once, 0, 1))
> +		pr_info("spectre-v4 mitigation disabled by command-line option\n");
>  
>  	return ret;
>  }

I think we should just avoid the printk() on the
spectre_v4_enable_task_mitigation() path. Well, I'd remove it altogether from the spectre_v4_mitigations_off() as it's called on kernel entry as well. Just add a different way to print the status during kernel boot if there isn't one already, maybe an initcall.

--
Catalin