[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3acdcd15-7e52-4a9a-9492-a434ed609dcc@linux.dev>
Date: Tue, 14 Oct 2025 18:59:07 +0800
From: Lance Yang <lance.yang@...ux.dev>
To: lirongqing <lirongqing@...du.com>, Petr Mladek <pmladek@...e.com>
Cc: wireguard@...ts.zx2c4.com, linux-arm-kernel@...ts.infradead.org,
"Liam R . Howlett" <Liam.Howlett@...cle.com>, linux-doc@...r.kernel.org,
David Hildenbrand <david@...hat.com>, Randy Dunlap <rdunlap@...radead.org>,
Stanislav Fomichev <sdf@...ichev.me>, linux-aspeed@...ts.ozlabs.org,
Andrew Jeffery <andrew@...econstruct.com.au>, Joel Stanley <joel@....id.au>,
Russell King <linux@...linux.org.uk>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Shuah Khan <shuah@...nel.org>,
Steven Rostedt <rostedt@...dmis.org>, Jonathan Corbet <corbet@....net>,
Joel Granados <joel.granados@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>, Phil Auld <pauld@...hat.com>,
linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org,
Masami Hiramatsu <mhiramat@...nel.org>, Jakub Kicinski <kuba@...nel.org>,
Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
Simon Horman <horms@...nel.org>,
Anshuman Khandual <anshuman.khandual@....com>,
Florian Westphal <fw@...len.de>, netdev@...r.kernel.org,
Kees Cook <kees@...nel.org>, Arnd Bergmann <arnd@...db.de>,
"Paul E . McKenney" <paulmck@...nel.org>,
Feng Tang <feng.tang@...ux.alibaba.com>,
"Jason A . Donenfeld" <Jason@...c4.com>
Subject: Re: [PATCH][v3] hung_task: Panic after fixed number of hung tasks
On 2025/10/14 17:45, Petr Mladek wrote:
> On Tue 2025-10-14 13:23:58, Lance Yang wrote:
>> Thanks for the patch!
>>
>> I noticed the implementation panics only when N tasks are detected
>> within a single scan, because total_hung_task is reset for each
>> check_hung_uninterruptible_tasks() run.
>
> Great catch!
>
> Does it make sense?
> Is is the intended behavior, please?
>
>> So some suggestions to align the documentation with the code's
>> behavior below :)
>
>> On 2025/10/12 19:50, lirongqing wrote:
>>> From: Li RongQing <lirongqing@...du.com>
>>>
>>> Currently, when 'hung_task_panic' is enabled, the kernel panics
>>> immediately upon detecting the first hung task. However, some hung
>>> tasks are transient and the system can recover, while others are
>>> persistent and may accumulate progressively.
>
> My understanding is that this patch wanted to do:
>
> + report even temporary stalls
> + panic only when the stall was much longer and likely persistent
>
> Which might make some sense. But the code does something else.
Cool. Sounds good to me!
>
>>> --- a/kernel/hung_task.c
>>> +++ b/kernel/hung_task.c
>>> @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
>>> */
>>> sysctl_hung_task_detect_count++;
>>> + total_hung_task = sysctl_hung_task_detect_count - prev_detect_count;
>>> trace_sched_process_hang(t);
>>> - if (sysctl_hung_task_panic) {
>>> + if (sysctl_hung_task_panic &&
>>> + (total_hung_task >= sysctl_hung_task_panic)) {
>>> console_verbose();
>>> hung_task_show_lock = true;
>>> hung_task_call_panic = true;
>
> I would expect that this patch added another counter, similar to
> sysctl_hung_task_detect_count. It would be incremented only
> once per check when a hung task was detected. And it would
> be cleared (reset) when no hung task was found.
Much cleaner. We could add an internal counter for that, yeah. No need
to expose it to userspace ;)
Petr's suggestion seems to align better with the goal of panicking on
persistent hangs, IMHO. Panic after N consecutive checks with hung tasks.
@RongQing does that work for you?
Powered by blists - more mailing lists