[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <81514e1d-4a10-4466-8a87-2d4b0927195b@paulmck-laptop>
Date: Fri, 26 Sep 2025 11:02:03 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Lance Yang <lance.yang@...ux.dev>
Cc: lirongqing <lirongqing@...du.com>, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, arnd@...db.de,
feng.tang@...ux.alibaba.com, joel.granados@...nel.org,
kees@...nel.org, rostedt@...dmis.org, pauld@...hat.com,
pawan.kumar.gupta@...ux.intel.com, mhiramat@...nel.org,
dave.hansen@...ux.intel.com, corbet@....net,
akpm@...ux-foundation.org, mingo@...nel.org
Subject: Re: [PATCH] hung_task: Panic after fixed number of hung tasks
On Thu, Sep 25, 2025 at 06:26:00PM +0800, Lance Yang wrote:
>
> Thanks for the patch!
>
> On 2025/9/25 14:06, lirongqing wrote:
> > From: Li RongQing <lirongqing@...du.com>
> >
> > Currently, when hung_task_panic is enabled, kernel will panic immediately
> > upon detecting the first hung task. However, some hung tasks are transient
> > and the system can recover fully, while others are unrecoverable and
> > trigger consecutive hung task reports, and a panic is expected.
>
> The new hung_task_count_to_panic relies on an absolute count, but I
> assume the real indicator you're trying to capture is the trend or
> rate of increase over a time window (e.g., "panic if count increases
> by 5 in 10 minutes").
>
> IMHO, this kind of time-windowed, trend-based logic seems much more
> flexible and better suited for a userspace monitoring agent :)
>
> In other words, why is this the right place for this feature?
A possibly related question is "why are RCU CPU stall warnings implemented
in the kernel instead of in userspace?" One reason is that by the
time that things get bad enough to trigger an RCU CPU stall warning,
userspace might not be capable of doing much of anything. Thus, there
is an uncomfortably high probability that orchestrating RCU CPU stall
warnings from userspace would cause these warnings to be lost entirely.
Similar reasoning might (or might not) apply to the hung-task mechanism.
Thanx, Paul
> Please sell it to us ;)
> Lance
>
> >
> > This commit adds a new sysctl parameter hung_task_count_to_panic to allows
> > specifying the number of consecutive hung tasks that must be detected
> > before triggering a kernel panic. This provides finer control for
> > environments where transient hangs maybe happen but persistent hangs should
> > still be fatal.
> >
> > Signed-off-by: Li RongQing <lirongqing@...du.com>
> > ---
> > Documentation/admin-guide/sysctl/kernel.rst | 6 ++++++
> > kernel/hung_task.c | 14 +++++++++++++-
> > 2 files changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> > index 8b49eab..4240e7b 100644
> > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > @@ -405,6 +405,12 @@ This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
> > 1 Panic immediately.
> > = =================================================
> > +hung_task_count_to_panic
> > +=====================
> > +
> > +When set to a non-zero value, after the number of consecutive hung task
> > +occur, the kernel will triggers a panic
> > +
> > hung_task_check_count
> > =====================
> > diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> > index 8708a12..87a6421 100644
> > --- a/kernel/hung_task.c
> > +++ b/kernel/hung_task.c
> > @@ -83,6 +83,8 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace;
> > static unsigned int __read_mostly sysctl_hung_task_panic =
> > IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
> > +static unsigned int __read_mostly sysctl_hung_task_count_to_panic;
> > +
> > static int
> > hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
> > {
> > @@ -219,7 +221,9 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
> > trace_sched_process_hang(t);
> > - if (sysctl_hung_task_panic) {
> > + if (sysctl_hung_task_panic ||
> > + (sysctl_hung_task_count_to_panic &&
> > + (sysctl_hung_task_detect_count >= sysctl_hung_task_count_to_panic))) {
> > console_verbose();
> > hung_task_show_lock = true;
> > hung_task_call_panic = true;
> > @@ -388,6 +392,14 @@ static const struct ctl_table hung_task_sysctls[] = {
> > .extra2 = SYSCTL_ONE,
> > },
> > {
> > + .procname = "hung_task_count_to_panic",
> > + .data = &sysctl_hung_task_count_to_panic,
> > + .maxlen = sizeof(int),
> > + .mode = 0644,
> > + .proc_handler = proc_dointvec_minmax,
> > + .extra1 = SYSCTL_ZERO,
> > + },
> > + {
> > .procname = "hung_task_check_count",
> > .data = &sysctl_hung_task_check_count,
> > .maxlen = sizeof(int),
>
Powered by blists - more mailing lists