linux-kernel - RE: [????] Re: [PATCH][v2] hung_task: Panic after fixed number of hung tasks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5f6b1d1ee66e4e6197ef22933c942503@baidu.com>
Date: Sat, 11 Oct 2025 14:53:42 +0000
From: "Li,Rongqing" <lirongqing@...du.com>
To: Masami Hiramatsu <mhiramat@...nel.org>
CC: "corbet@....net" <corbet@....net>, "akpm@...ux-foundation.org"
	<akpm@...ux-foundation.org>, "lance.yang@...ux.dev" <lance.yang@...ux.dev>,
	"paulmck@...nel.org" <paulmck@...nel.org>,
	"pawan.kumar.gupta@...ux.intel.com" <pawan.kumar.gupta@...ux.intel.com>,
	"mingo@...nel.org" <mingo@...nel.org>, "dave.hansen@...ux.intel.com"
	<dave.hansen@...ux.intel.com>, "rostedt@...dmis.org" <rostedt@...dmis.org>,
	"kees@...nel.org" <kees@...nel.org>, "arnd@...db.de" <arnd@...db.de>,
	"feng.tang@...ux.alibaba.com" <feng.tang@...ux.alibaba.com>,
	"pauld@...hat.com" <pauld@...hat.com>, "joel.granados@...nel.org"
	<joel.granados@...nel.org>, "linux-doc@...r.kernel.org"
	<linux-doc@...r.kernel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: RE: [????] Re: [PATCH][v2] hung_task: Panic after fixed number of
 hung tasks



> -----Original Message-----
> From: Li,Rongqing
> Sent: 2025年10月11日 20:03
> To: 'Masami Hiramatsu' <mhiramat@...nel.org>
> Cc: corbet@....net; akpm@...ux-foundation.org; lance.yang@...ux.dev;
> paulmck@...nel.org; pawan.kumar.gupta@...ux.intel.com; mingo@...nel.org;
> dave.hansen@...ux.intel.com; rostedt@...dmis.org; kees@...nel.org;
> arnd@...db.de; feng.tang@...ux.alibaba.com; pauld@...hat.com;
> joel.granados@...nel.org; linux-doc@...r.kernel.org;
> linux-kernel@...r.kernel.org
> Subject: RE: [????] Re: [PATCH][v2] hung_task: Panic after fixed number of
> hung tasks
> 
> 
> 
> > -----Original Message-----
> > From: Masami Hiramatsu <mhiramat@...nel.org>
> > Sent: 2025年9月29日 8:48
> > To: Li,Rongqing <lirongqing@...du.com>
> > Cc: corbet@....net; akpm@...ux-foundation.org; lance.yang@...ux.dev;
> > paulmck@...nel.org; pawan.kumar.gupta@...ux.intel.com;
> > mingo@...nel.org; dave.hansen@...ux.intel.com; rostedt@...dmis.org;
> > kees@...nel.org; arnd@...db.de; feng.tang@...ux.alibaba.com;
> > pauld@...hat.com; joel.granados@...nel.org; linux-doc@...r.kernel.org;
> > linux-kernel@...r.kernel.org
> > Subject: [????] Re: [PATCH][v2] hung_task: Panic after fixed number of
> > hung tasks
> >
> > On Sun, 28 Sep 2025 13:31:37 +0800
> > lirongqing <lirongqing@...du.com> wrote:
> >
> > > From: Li RongQing <lirongqing@...du.com>
> > >
> > > Currently, when hung_task_panic is enabled, kernel will panic
> > > immediately upon detecting the first hung task. However, some hung
> > > tasks are transient and the system can recover fully, while others
> > > are unrecoverable and trigger consecutive hung task reports, and a
> > > panic is
> > expected.
> > >
> > > This commit adds a new sysctl parameter hung_task_count_to_panic to
> > > allows specifying the number of consecutive hung tasks that must be
> > > detected before triggering a kernel panic. This provides finer
> > > control for environments where transient hangs maybe happen but
> > > persistent hangs should still be fatal.
> >
> > IIUC, perhaps there are multiple groups that require different
> > timeouts for hang checks, and you want to set the hung task timeout to
> > match the shorter one, but ignore the longer ones at that point.
> >
> > If so, this is essentially a problem with a long process that is
> > performed under TASK_UNINTERRUPTIBLE. Ideally, the progress of such
> > process should be checked periodically and the hang check should be
> > reset unless it is real blocked.
> > But this is not currently implemented. (For example, depending on the
> > media, it may not be possible to check whether long IO is being
> > performed.)
> >
> > The hung_tasks will even simulate these types of hangs as task
> > hang-ups. But if you set a long detection time accordingly, you will
> > also have to wait until that detection time for hangs that occur in a short
> period of time.
> >
> > The hung tasks on one major lock can spread in a domino effect.
> > So setting a reasonably short detection time, but not panicking until
> > there are enough of them, seems like a reasonable strategy.
> > But in this case, I think we also need a "hard timeout limit"
> > of hung tasks, which will detect longer ones. And also you should use
> > peak value not accumulation value.
> >
> > If it is really transient (thus, it is not hung), accumulation of such
> > normal but just slow operation will still kick hung_tasks.
> >
> 
> 
> Is it reasonable to detect the existence of a hung task continuously for a
> certain number of times to trigger panic?
> 
> Like
> 
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c index d17cd3f..045bef5
> 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -304,6 +304,8 @@ static void
> check_hung_uninterruptible_tasks(unsigned long timeout)
>         int max_count = sysctl_hung_task_check_count;
>         unsigned long last_break = jiffies;
>         struct task_struct *g, *t;
> +       unsigned long pre_detect_count = sysctl_hung_task_detect_count;
> +       static unsigned long contiguous_detect_count;
> 
>         /*
>          * If the system crashed already then all bets are off, @@ -326,6
> +328,15 @@ static void check_hung_uninterruptible_tasks(unsigned long
> timeout)
> 
>                 check_hung_task(t, timeout);
>         }
> +
> +       if (sysctl_hung_task_detect_count != pre_detect_count) {
> +               contiguous_detect_count++;
> +               if (sysctl_max_hung_task_to_panic &&
> +                               contiguous_detect_count >
> sysctl_max_hung_task_to_panic)
> +                       hung_task_call_panic = 1;
> +       }
> +       else
> +               contiguous_detect_count = 0;
>   unlock:
>         rcu_read_unlock();
>         if (hung_task_show_lock)
> 
> 

A single task hanging for an extended period may not be a critical issue, as users might still log into the system to investigate. However, if multiple tasks hang simultaneously―such as in cases of I/O hangs caused by disk failures―it could prevent users from logging in and become a serious problem, and a panic is expected. Therefore, the solution should be designed as follows:

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index d17cd3f..52ebf18 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -304,6 +304,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
        int max_count = sysctl_hung_task_check_count;
        unsigned long last_break = jiffies;
        struct task_struct *g, *t;
+       unsigned long pre_detect_count = sysctl_hung_task_detect_count;

        /*
         * If the system crashed already then all bets are off,
@@ -326,6 +327,10 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)

                check_hung_task(t, timeout);
        }
+
+       if (sysctl_hung_task_detect_count - pre_detect_count > sysctl_max_hung_task_to_panic) {
+               hung_task_call_panic = 1;
+       }
  unlock:
        rcu_read_unlock();
        if (hung_task_show_lock)


-Li

> > -Li
> 
> > Thank you,
> >
> > >
> > > Acked-by: Lance Yang <lance.yang@...ux.dev>
> > > Signed-off-by: Li RongQing <lirongqing@...du.com>
> > > ---
> > > Diff with v1: change documentation as Lance suggested
> > >
> > >  Documentation/admin-guide/sysctl/kernel.rst |  8 ++++++++
> > >  kernel/hung_task.c                          | 14 +++++++++++++-
> > >  2 files changed, 21 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst
> > > b/Documentation/admin-guide/sysctl/kernel.rst
> > > index 8b49eab..98b47a7 100644
> > > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > > @@ -405,6 +405,14 @@ This file shows up if
> > ``CONFIG_DETECT_HUNG_TASK`` is enabled.
> > >  1 Panic immediately.
> > >  = =================================================
> > >
> > > +hung_task_count_to_panic
> > > +=====================
> > > +
> > > +When set to a non-zero value, a kernel panic will be triggered if
> > > +the number of detected hung tasks reaches this value.
> > > +
> > > +Note that setting hung_task_panic=1 will still cause an immediate
> > > +panic on the first hung task.
> >
> > What happen if it is 0?
> >
> > >
> > >  hung_task_check_count
> > >  =====================
> > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c index
> > > 8708a12..87a6421 100644
> > > --- a/kernel/hung_task.c
> > > +++ b/kernel/hung_task.c
> > > @@ -83,6 +83,8 @@ static unsigned int __read_mostly
> > > sysctl_hung_task_all_cpu_backtrace;
> > >  static unsigned int __read_mostly sysctl_hung_task_panic =
> > >  	IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
> > >
> > > +static unsigned int __read_mostly sysctl_hung_task_count_to_panic;
> > > +
> > >  static int
> > >  hung_task_panic(struct notifier_block *this, unsigned long event,
> > > void *ptr)  { @@ -219,7 +221,9 @@ static void check_hung_task(struct
> > > task_struct *t, unsigned long timeout)
> > >
> > >  	trace_sched_process_hang(t);
> > >
> > > -	if (sysctl_hung_task_panic) {
> > > +	if (sysctl_hung_task_panic ||
> > > +	    (sysctl_hung_task_count_to_panic &&
> > > +	     (sysctl_hung_task_detect_count >=
> > > +sysctl_hung_task_count_to_panic))) {
> > >  		console_verbose();
> > >  		hung_task_show_lock = true;
> > >  		hung_task_call_panic = true;
> > > @@ -388,6 +392,14 @@ static const struct ctl_table hung_task_sysctls[] =
> {
> > >  		.extra2		= SYSCTL_ONE,
> > >  	},
> > >  	{
> > > +		.procname	= "hung_task_count_to_panic",
> > > +		.data		= &sysctl_hung_task_count_to_panic,
> > > +		.maxlen		= sizeof(int),
> > > +		.mode		= 0644,
> > > +		.proc_handler	= proc_dointvec_minmax,
> > > +		.extra1		= SYSCTL_ZERO,
> > > +	},
> > > +	{
> > >  		.procname	= "hung_task_check_count",
> > >  		.data		= &sysctl_hung_task_check_count,
> > >  		.maxlen		= sizeof(int),
> > > --
> > > 2.9.4
> > >
> >
> >
> > --
> > Masami Hiramatsu (Google) <mhiramat@...nel.org>