linux-kernel - Re: [PATCH][RFC] hung_task: Support to panic when the maximum number of hung task warnings is reached

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20250922204554.55dd890090b0f56ad10a61f5@linux-foundation.org>
Date: Mon, 22 Sep 2025 20:45:54 -0700
From: Andrew Morton <akpm@...ux-foundation.org>
To: lirongqing <lirongqing@...du.com>
Cc: <corbet@....net>, <lance.yang@...ux.dev>, <mhiramat@...nel.org>,
 <paulmck@...nel.org>, <pawan.kumar.gupta@...ux.intel.com>,
 <mingo@...nel.org>, <dave.hansen@...ux.intel.com>, <rostedt@...dmis.org>,
 <kees@...nel.org>, <arnd@...db.de>, <feng.tang@...ux.alibaba.com>,
 <pauld@...hat.com>, <joel.granados@...nel.org>,
 <linux-doc@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH][RFC] hung_task: Support to panic when the maximum
 number of hung task warnings is reached

On Tue, 23 Sep 2025 11:37:40 +0800 lirongqing <lirongqing@...du.com> wrote:

> Currently the hung task detector can either panic immediately or continue
> operation when hung tasks are detected. However, there are scenarios
> where we want a more balanced approach:
> 
> - We don't want the system to panic immediately when a few hung tasks
>   are detected, as the system may be able to recover
> - And we also don't want the system to stall indefinitely with multiple
>   hung tasks
> 
> This commit introduces a new mode (value 2) for the hung task panic behavior.
> When set to 2, the system will panic only after the maximum number of hung
> task warnings (hung_task_warnings) has been reached.
> 
> This provides a middle ground between immediate panic and potentially
> infinite stall, allowing for automated vmcore generation after a reasonable

I assume the same argument applies to the NMI watchdog, to the
softlockup detector and to the RCU stall detector?

A general framework to handle all of these might be better.  But why do
it in kernel at all?  What about a userspace detector which parses
kernel logs (or new procfs counters) and makes such decisions?