[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aBxE6jXwjIDRdr1z@U-2FWC9VHC-2323.local>
Date: Thu, 8 May 2025 13:45:14 +0800
From: Feng Tang <feng.tang@...ux.alibaba.com>
To: Lance Yang <lance.yang@...ux.dev>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Petr Mladek <pmladek@...e.com>,
Steven Rostedt <rostedt@...dmis.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC 2/3] kernel/hung_task: add option to dump system info
when hung task detected
Hi Lance,
Many thanks for the review!
On Thu, May 08, 2025 at 11:02:22AM +0800, Lance Yang wrote:
> Hi Feng,
>
> Thanks for the patch series!
>
> On 2025/5/7 18:43, Feng Tang wrote:
> > Kernel panic code utilizes sys_show_info() to dump needed system
> > information to help debugging. Similarly, add this debug option for
> > task hung case, and 'hungtask_print' is the knob to control what
> > information should be printed out.
> >
> > Also clean up the code about dumping locks and triggering backtrace
> > for all CPUs. One todo may be to merge this 'hungtask_print' with
> > some sysctl knobs in hung_task.c.
> >
> > Signed-off-by: Feng Tang <feng.tang@...ux.alibaba.com>
> > ---
> > kernel/hung_task.c | 29 ++++++++++++++++-------------
> > 1 file changed, 16 insertions(+), 13 deletions(-)
> >
> > diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> > index dc898ec93463..8229637be2c7 100644
> > --- a/kernel/hung_task.c
> > +++ b/kernel/hung_task.c
> > @@ -58,12 +58,20 @@ static unsigned long __read_mostly sysctl_hung_task_check_interval_secs;
> > static int __read_mostly sysctl_hung_task_warnings = 10;
> > static int __read_mostly did_panic;
> > -static bool hung_task_show_lock;
> > static bool hung_task_call_panic;
> > -static bool hung_task_show_all_bt;
> > static struct task_struct *watchdog_task;
> > +/*
> > + * A bitmask to control what kinds of system info to be printed when a
> > + * hung task is detected, it could be task, memory, lock etc. Refer panic.h
> > + * for details of bit definition.
> > + */
> > +unsigned long hungtask_print;
> > +core_param(hungtask_print, hungtask_print, ulong, 0644);
>
> how about lockup_debug_print_mask?
The 3/3 patch has a 'lockup_print' as it is for soft/hard lockup :).
The name follows the existing 'panic_print', and indeed it's actually
a bitmask, how about 'hung_print_mask'?
>
> It could be useful for debugging, but there are a few concerns:
>
> 1) SYS_PRINT_* vs. hung_task_* priority conflict
> - If SYS_PRINT_ALL_CPU_BT is set on the command line but
> hung_task_all_cpu_backtrace is disabled, which one wins?
> - Or should SYS_PRINT_ALL_CPU_BT force-enable hung_task_all_cpu_backtrace?
With this patch, the 'hungtask_print' and hung_task_all_cpu_backtrace
will be ORed, so yes, if user sets SYS_PRINT_ALL_CPU_BT explicitly, the
all-cpu-backtrace will be printed.
While the default value for hungtask_print is 0, and no system info will
be dumped by default.
Long term wise, I'm not sure if sysctl_hung_task_all_cpu_backtracemay
could be removed as its function can be covered by this print_mask knob.
> 2) Duplicate prints
> With SYS_PRINT_BLOCKED_TASKS enabled, processes in D state will be printed
> twice, right?
Good point. As sys_show_info() is a general API helper, the user may chose
not to set SYS_PRINT_BLOCKED_TASKS when debugging task hung.
In one recent bug we debugged with this patch, when the first "task hung" was
shown, there were already dozens of tasks were in D state, which just hadn't
hit the 120 seconds limit yet, and dumping them all helped in that case.
> Also, we really should document how those command-line parameters work ;)
Exactly! It currently just said 'refer panic.h' in code comment, maybe I
should copy those definitions here as comments. How do you think?
Thanks,
Feng
> Thansk,
> Lance
>
> > +
> > +static unsigned long cur_hungtask_print;
> > +
> > #ifdef CONFIG_SMP
> > /*
> > * Should we dump all CPUs backtraces in a hung task event?
> > @@ -163,11 +171,12 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
> > */
> > sysctl_hung_task_detect_count++;
> > + cur_hungtask_print = hungtask_print;
> > trace_sched_process_hang(t);
> > if (sysctl_hung_task_panic) {
> > console_verbose();
> > - hung_task_show_lock = true;
> > + cur_hungtask_print |= SYS_PRINT_LOCK_INFO;
> > hung_task_call_panic = true;
> > }
> > @@ -190,10 +199,10 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
> > " disables this message.\n");
> > sched_show_task(t);
> > debug_show_blocker(t);
> > - hung_task_show_lock = true;
> > + cur_hungtask_print |= SYS_PRINT_LOCK_INFO;
> > if (sysctl_hung_task_all_cpu_backtrace)
> > - hung_task_show_all_bt = true;
> > + cur_hungtask_print |= SYS_PRINT_ALL_CPU_BT;
> > if (!sysctl_hung_task_warnings)
> > pr_info("Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings\n");
> > }
> > @@ -242,7 +251,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
> > if (test_taint(TAINT_DIE) || did_panic)
> > return;
> > - hung_task_show_lock = false;
> > + cur_hungtask_print = 0;
> > rcu_read_lock();
> > for_each_process_thread(g, t) {
> > unsigned int state;
> > @@ -266,14 +275,8 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
> > }
> > unlock:
> > rcu_read_unlock();
> > - if (hung_task_show_lock)
> > - debug_show_all_locks();
> > -
> > - if (hung_task_show_all_bt) {
> > - hung_task_show_all_bt = false;
> > - trigger_all_cpu_backtrace();
> > - }
> > + sys_show_info(cur_hungtask_print);
> > if (hung_task_call_panic)
> > panic("hung_task: blocked tasks");
> > }
Powered by blists - more mailing lists