[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5f6b1d1ee66e4e6197ef22933c942503@baidu.com>
Date: Sat, 11 Oct 2025 14:53:42 +0000
From: "Li,Rongqing" <lirongqing@...du.com>
To: Masami Hiramatsu <mhiramat@...nel.org>
CC: "corbet@....net" <corbet@....net>, "akpm@...ux-foundation.org"
<akpm@...ux-foundation.org>, "lance.yang@...ux.dev" <lance.yang@...ux.dev>,
"paulmck@...nel.org" <paulmck@...nel.org>,
"pawan.kumar.gupta@...ux.intel.com" <pawan.kumar.gupta@...ux.intel.com>,
"mingo@...nel.org" <mingo@...nel.org>, "dave.hansen@...ux.intel.com"
<dave.hansen@...ux.intel.com>, "rostedt@...dmis.org" <rostedt@...dmis.org>,
"kees@...nel.org" <kees@...nel.org>, "arnd@...db.de" <arnd@...db.de>,
"feng.tang@...ux.alibaba.com" <feng.tang@...ux.alibaba.com>,
"pauld@...hat.com" <pauld@...hat.com>, "joel.granados@...nel.org"
<joel.granados@...nel.org>, "linux-doc@...r.kernel.org"
<linux-doc@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>
Subject: RE: [????] Re: [PATCH][v2] hung_task: Panic after fixed number of
hung tasks
> -----Original Message-----
> From: Li,Rongqing
> Sent: 2025年10月11日 20:03
> To: 'Masami Hiramatsu' <mhiramat@...nel.org>
> Cc: corbet@....net; akpm@...ux-foundation.org; lance.yang@...ux.dev;
> paulmck@...nel.org; pawan.kumar.gupta@...ux.intel.com; mingo@...nel.org;
> dave.hansen@...ux.intel.com; rostedt@...dmis.org; kees@...nel.org;
> arnd@...db.de; feng.tang@...ux.alibaba.com; pauld@...hat.com;
> joel.granados@...nel.org; linux-doc@...r.kernel.org;
> linux-kernel@...r.kernel.org
> Subject: RE: [????] Re: [PATCH][v2] hung_task: Panic after fixed number of
> hung tasks
>
>
>
> > -----Original Message-----
> > From: Masami Hiramatsu <mhiramat@...nel.org>
> > Sent: 2025年9月29日 8:48
> > To: Li,Rongqing <lirongqing@...du.com>
> > Cc: corbet@....net; akpm@...ux-foundation.org; lance.yang@...ux.dev;
> > paulmck@...nel.org; pawan.kumar.gupta@...ux.intel.com;
> > mingo@...nel.org; dave.hansen@...ux.intel.com; rostedt@...dmis.org;
> > kees@...nel.org; arnd@...db.de; feng.tang@...ux.alibaba.com;
> > pauld@...hat.com; joel.granados@...nel.org; linux-doc@...r.kernel.org;
> > linux-kernel@...r.kernel.org
> > Subject: [????] Re: [PATCH][v2] hung_task: Panic after fixed number of
> > hung tasks
> >
> > On Sun, 28 Sep 2025 13:31:37 +0800
> > lirongqing <lirongqing@...du.com> wrote:
> >
> > > From: Li RongQing <lirongqing@...du.com>
> > >
> > > Currently, when hung_task_panic is enabled, kernel will panic
> > > immediately upon detecting the first hung task. However, some hung
> > > tasks are transient and the system can recover fully, while others
> > > are unrecoverable and trigger consecutive hung task reports, and a
> > > panic is
> > expected.
> > >
> > > This commit adds a new sysctl parameter hung_task_count_to_panic to
> > > allows specifying the number of consecutive hung tasks that must be
> > > detected before triggering a kernel panic. This provides finer
> > > control for environments where transient hangs maybe happen but
> > > persistent hangs should still be fatal.
> >
> > IIUC, perhaps there are multiple groups that require different
> > timeouts for hang checks, and you want to set the hung task timeout to
> > match the shorter one, but ignore the longer ones at that point.
> >
> > If so, this is essentially a problem with a long process that is
> > performed under TASK_UNINTERRUPTIBLE. Ideally, the progress of such
> > process should be checked periodically and the hang check should be
> > reset unless it is real blocked.
> > But this is not currently implemented. (For example, depending on the
> > media, it may not be possible to check whether long IO is being
> > performed.)
> >
> > The hung_tasks will even simulate these types of hangs as task
> > hang-ups. But if you set a long detection time accordingly, you will
> > also have to wait until that detection time for hangs that occur in a short
> period of time.
> >
> > The hung tasks on one major lock can spread in a domino effect.
> > So setting a reasonably short detection time, but not panicking until
> > there are enough of them, seems like a reasonable strategy.
> > But in this case, I think we also need a "hard timeout limit"
> > of hung tasks, which will detect longer ones. And also you should use
> > peak value not accumulation value.
> >
> > If it is really transient (thus, it is not hung), accumulation of such
> > normal but just slow operation will still kick hung_tasks.
> >
>
>
> Is it reasonable to detect the existence of a hung task continuously for a
> certain number of times to trigger panic?
>
> Like
>
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c index d17cd3f..045bef5
> 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -304,6 +304,8 @@ static void
> check_hung_uninterruptible_tasks(unsigned long timeout)
> int max_count = sysctl_hung_task_check_count;
> unsigned long last_break = jiffies;
> struct task_struct *g, *t;
> + unsigned long pre_detect_count = sysctl_hung_task_detect_count;
> + static unsigned long contiguous_detect_count;
>
> /*
> * If the system crashed already then all bets are off, @@ -326,6
> +328,15 @@ static void check_hung_uninterruptible_tasks(unsigned long
> timeout)
>
> check_hung_task(t, timeout);
> }
> +
> + if (sysctl_hung_task_detect_count != pre_detect_count) {
> + contiguous_detect_count++;
> + if (sysctl_max_hung_task_to_panic &&
> + contiguous_detect_count >
> sysctl_max_hung_task_to_panic)
> + hung_task_call_panic = 1;
> + }
> + else
> + contiguous_detect_count = 0;
> unlock:
> rcu_read_unlock();
> if (hung_task_show_lock)
>
>
A single task hanging for an extended period may not be a critical issue, as users might still log into the system to investigate. However, if multiple tasks hang simultaneously―such as in cases of I/O hangs caused by disk failures―it could prevent users from logging in and become a serious problem, and a panic is expected. Therefore, the solution should be designed as follows:
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index d17cd3f..52ebf18 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -304,6 +304,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
int max_count = sysctl_hung_task_check_count;
unsigned long last_break = jiffies;
struct task_struct *g, *t;
+ unsigned long pre_detect_count = sysctl_hung_task_detect_count;
/*
* If the system crashed already then all bets are off,
@@ -326,6 +327,10 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
check_hung_task(t, timeout);
}
+
+ if (sysctl_hung_task_detect_count - pre_detect_count > sysctl_max_hung_task_to_panic) {
+ hung_task_call_panic = 1;
+ }
unlock:
rcu_read_unlock();
if (hung_task_show_lock)
-Li
> > -Li
>
> > Thank you,
> >
> > >
> > > Acked-by: Lance Yang <lance.yang@...ux.dev>
> > > Signed-off-by: Li RongQing <lirongqing@...du.com>
> > > ---
> > > Diff with v1: change documentation as Lance suggested
> > >
> > > Documentation/admin-guide/sysctl/kernel.rst | 8 ++++++++
> > > kernel/hung_task.c | 14 +++++++++++++-
> > > 2 files changed, 21 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst
> > > b/Documentation/admin-guide/sysctl/kernel.rst
> > > index 8b49eab..98b47a7 100644
> > > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > > @@ -405,6 +405,14 @@ This file shows up if
> > ``CONFIG_DETECT_HUNG_TASK`` is enabled.
> > > 1 Panic immediately.
> > > = =================================================
> > >
> > > +hung_task_count_to_panic
> > > +=====================
> > > +
> > > +When set to a non-zero value, a kernel panic will be triggered if
> > > +the number of detected hung tasks reaches this value.
> > > +
> > > +Note that setting hung_task_panic=1 will still cause an immediate
> > > +panic on the first hung task.
> >
> > What happen if it is 0?
> >
> > >
> > > hung_task_check_count
> > > =====================
> > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c index
> > > 8708a12..87a6421 100644
> > > --- a/kernel/hung_task.c
> > > +++ b/kernel/hung_task.c
> > > @@ -83,6 +83,8 @@ static unsigned int __read_mostly
> > > sysctl_hung_task_all_cpu_backtrace;
> > > static unsigned int __read_mostly sysctl_hung_task_panic =
> > > IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
> > >
> > > +static unsigned int __read_mostly sysctl_hung_task_count_to_panic;
> > > +
> > > static int
> > > hung_task_panic(struct notifier_block *this, unsigned long event,
> > > void *ptr) { @@ -219,7 +221,9 @@ static void check_hung_task(struct
> > > task_struct *t, unsigned long timeout)
> > >
> > > trace_sched_process_hang(t);
> > >
> > > - if (sysctl_hung_task_panic) {
> > > + if (sysctl_hung_task_panic ||
> > > + (sysctl_hung_task_count_to_panic &&
> > > + (sysctl_hung_task_detect_count >=
> > > +sysctl_hung_task_count_to_panic))) {
> > > console_verbose();
> > > hung_task_show_lock = true;
> > > hung_task_call_panic = true;
> > > @@ -388,6 +392,14 @@ static const struct ctl_table hung_task_sysctls[] =
> {
> > > .extra2 = SYSCTL_ONE,
> > > },
> > > {
> > > + .procname = "hung_task_count_to_panic",
> > > + .data = &sysctl_hung_task_count_to_panic,
> > > + .maxlen = sizeof(int),
> > > + .mode = 0644,
> > > + .proc_handler = proc_dointvec_minmax,
> > > + .extra1 = SYSCTL_ZERO,
> > > + },
> > > + {
> > > .procname = "hung_task_check_count",
> > > .data = &sysctl_hung_task_check_count,
> > > .maxlen = sizeof(int),
> > > --
> > > 2.9.4
> > >
> >
> >
> > --
> > Masami Hiramatsu (Google) <mhiramat@...nel.org>
Powered by blists - more mailing lists