linux-kernel - Re: [PATCH][RFC] hung_task: Support to panic when the maximum number of hung task warnings is reached

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9067a88d-f5df-4d6e-b3b3-2e266ebcf3d0@linux.dev>
Date: Tue, 23 Sep 2025 11:59:30 +0800
From: Lance Yang <lance.yang@...ux.dev>
To: lirongqing <lirongqing@...du.com>,
 Andrew Morton <akpm@...ux-foundation.org>
Cc: corbet@....net, mhiramat@...nel.org, paulmck@...nel.org,
 pawan.kumar.gupta@...ux.intel.com, mingo@...nel.org,
 dave.hansen@...ux.intel.com, rostedt@...dmis.org, kees@...nel.org,
 arnd@...db.de, feng.tang@...ux.alibaba.com, pauld@...hat.com,
 joel.granados@...nel.org, linux-doc@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH][RFC] hung_task: Support to panic when the maximum number
 of hung task warnings is reached



On 2025/9/23 11:45, Andrew Morton wrote:
> On Tue, 23 Sep 2025 11:37:40 +0800 lirongqing <lirongqing@...du.com> wrote:
> 
>> Currently the hung task detector can either panic immediately or continue
>> operation when hung tasks are detected. However, there are scenarios
>> where we want a more balanced approach:
>>
>> - We don't want the system to panic immediately when a few hung tasks
>>    are detected, as the system may be able to recover
>> - And we also don't want the system to stall indefinitely with multiple
>>    hung tasks
>>
>> This commit introduces a new mode (value 2) for the hung task panic behavior.
>> When set to 2, the system will panic only after the maximum number of hung
>> task warnings (hung_task_warnings) has been reached.
>>
>> This provides a middle ground between immediate panic and potentially
>> infinite stall, allowing for automated vmcore generation after a reasonable
> 
> I assume the same argument applies to the NMI watchdog, to the
> softlockup detector and to the RCU stall detector?
> 
> A general framework to handle all of these might be better.  But why do
> it in kernel at all?  What about a userspace detector which parses
> kernel logs (or new procfs counters) and makes such decisions?

+1. I agree that a userspace detector seems more appropriate for this.

We already have the hung_task_detect_count counter, so a userspace
detector could easily use that to implement custom policies ;)