[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <36db2f10-ebbe-4ecd-b27f-e02d9e1569c2@paulmck-laptop>
Date: Mon, 22 Sep 2025 23:03:52 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: "Li,Rongqing" <lirongqing@...du.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
"corbet@....net" <corbet@....net>,
"lance.yang@...ux.dev" <lance.yang@...ux.dev>,
"mhiramat@...nel.org" <mhiramat@...nel.org>,
"pawan.kumar.gupta@...ux.intel.com" <pawan.kumar.gupta@...ux.intel.com>,
"mingo@...nel.org" <mingo@...nel.org>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
"rostedt@...dmis.org" <rostedt@...dmis.org>,
"kees@...nel.org" <kees@...nel.org>,
"arnd@...db.de" <arnd@...db.de>,
"feng.tang@...ux.alibaba.com" <feng.tang@...ux.alibaba.com>,
"pauld@...hat.com" <pauld@...hat.com>,
"joel.granados@...nel.org" <joel.granados@...nel.org>,
"linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [????] Re: [PATCH][RFC] hung_task: Support to panic when the
maximum number of hung task warnings is reached
On Tue, Sep 23, 2025 at 04:00:03AM +0000, Li,Rongqing wrote:
>
>
> > -----Original Message-----
> > From: Andrew Morton <akpm@...ux-foundation.org>
> > Sent: 2025年9月23日 11:46
> > To: Li,Rongqing <lirongqing@...du.com>
> > Cc: corbet@....net; lance.yang@...ux.dev; mhiramat@...nel.org;
> > paulmck@...nel.org; pawan.kumar.gupta@...ux.intel.com; mingo@...nel.org;
> > dave.hansen@...ux.intel.com; rostedt@...dmis.org; kees@...nel.org;
> > arnd@...db.de; feng.tang@...ux.alibaba.com; pauld@...hat.com;
> > joel.granados@...nel.org; linux-doc@...r.kernel.org;
> > linux-kernel@...r.kernel.org
> > Subject: [????] Re: [PATCH][RFC] hung_task: Support to panic when the
> > maximum number of hung task warnings is reached
> >
> > On Tue, 23 Sep 2025 11:37:40 +0800 lirongqing <lirongqing@...du.com> wrote:
> >
> > > Currently the hung task detector can either panic immediately or
> > > continue operation when hung tasks are detected. However, there are
> > > scenarios where we want a more balanced approach:
> > >
> > > - We don't want the system to panic immediately when a few hung tasks
> > > are detected, as the system may be able to recover
> > > - And we also don't want the system to stall indefinitely with multiple
> > > hung tasks
> > >
> > > This commit introduces a new mode (value 2) for the hung task panic behavior.
> > > When set to 2, the system will panic only after the maximum number of
> > > hung task warnings (hung_task_warnings) has been reached.
> > >
> > > This provides a middle ground between immediate panic and potentially
> > > infinite stall, allowing for automated vmcore generation after a
> > > reasonable
> >
> > I assume the same argument applies to the NMI watchdog, to the softlockup
> > detector and to the RCU stall detector?
>
> True, especial RCU stall detector
There are the panic_on_rcu_stall and max_rcu_stall_to_panic sysctls, which
together allow you to panic after (say) three RCU CPU stall warnings.
Does those do what you need?
Thanx, Paul
> > A general framework to handle all of these might be better. But why do it in
> > kernel at all? What about a userspace detector which parses kernel logs (or
> > new procfs counters) and makes such decisions?
>
>
> By leveraging existing kernel mechanisms, implementation in kernel is very simple and reliable, I think
>
> Thanks
>
> -Li
>
Powered by blists - more mailing lists