[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6e1b7a63-967b-4004-ba14-ad46f1b3b271@linux.dev>
Date: Mon, 12 Jan 2026 22:43:35 +0800
From: Lance Yang <lance.yang@...ux.dev>
To: Petr Mladek <pmladek@...e.com>, Aaron Tomlin <atomlin@...mlin.com>
Cc: akpm@...ux-foundation.org, mhiramat@...nel.org,
gregkh@...uxfoundation.org, joel.granados@...nel.org, sean@...e.io,
linux-kernel@...r.kernel.org
Subject: Re: [v5 PATCH 2/2] hung_task: Enable runtime reset of
hung_task_detect_count
On 2026/1/12 21:13, Petr Mladek wrote:
> On Fri 2026-01-09 21:50:20, Lance Yang wrote:
>>
>>
>> On 2026/1/8 22:41, Petr Mladek wrote:
>>> On Tue 2025-12-30 19:41:25, Aaron Tomlin wrote:
>>>> Introduce support for writing to /proc/sys/kernel/hung_task_detect_count.
>>>>
>>>> Writing a value of zero to this file atomically resets the counter of
>>>> detected hung tasks. This grants system administrators the ability to
>>>> clear the cumulative diagnostic history after resolving an incident,
>>>> simplifying monitoring without requiring a system restart.
>>>
>>>> --- a/kernel/hung_task.c
>>>> +++ b/kernel/hung_task.c
>>>> @@ -346,7 +355,14 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
>>>> unlock:
>>>> rcu_read_unlock();
>>>> - if (!(sysctl_hung_task_detect_count - prev_detect_count))
>>>> + /* Ensures we see all hang details recorded during the scan. */
>>>> + cur_detect_count = atomic_long_read_acquire(&sysctl_hung_task_detect_count);
>>>
>>> This value is read at the end of the scan => _release
>>> semantic/barrier should be here.
>>
>> Seems like _acquire is still correct here, because it is a load.
>>
>> _release semantics apply to stores, while _acquire on a load
>> ensures subsequent memory accesses are not reordered before it.
>
> Right!
>
>> Or smp_mb()?
>>
>> In the same thread, atomic operations on the same variable are not
>> reordered with respect to each other, even the _relaxed variant
>> preserves program order for that variable, IIRC.
>>
>> So the increment will always complete before the final read in
>> program order, and the read will see the updated value (unless
>> another CPU resets it concurrently, which is a logical race, not
>> a reordering issue).
>>
>> So, it would be:
>>
>> prev = atomic_long_read_acquire(&counter); // scan start
>> ...
>> cur = atomic_long_inc_return_relaxed(&counter); // during scan
>> ...
>> cur = atomic_long_read_acquire(&counter); // scan end
>>
>> The first _acquire ensures no task-checking code is reordered
>> before the start read, the middle increment is just atomic
>> without extra barriers, and the final _acquire makes sure we
>> observe all hang details before computing the delta.
>
> The acquire/relaxed/acquire semantic looks weird. The problem is
> that we do not use the counter as a lock.
>
> I thought about a sane approach and the following came to my
> mind:
>
> From c28b74c35d653f527aa9017c32630ad08180fb4e Mon Sep 17 00:00:00 2001
> From: Petr Mladek <pmladek@...e.com>
> Date: Mon, 12 Jan 2026 14:00:52 +0100
> Subject: [POC] hung_task: Update the global counter using a proper
> acquire/release semantic
>
> The global counter of hung tasks might get reset when the check
> is in progress. Also the number of hung tasks detected in the current
> round is important to decide whether panic() is needed or not.
>
> Handle races by:
>
> 1. Remember the total counter at the beginnning of the check.
> 2. Count the current round in a local variable.
> 3. Udpate the total counter only when the value has not been modified
> during the check.
Cool!
>
> Note that this is only compile tested.
>
> Signed-off-by: Petr Mladek <pmladek@...e.com>
> ---
> kernel/hung_task.c | 53 +++++++++++++++++-----------------------------
> 1 file changed, 20 insertions(+), 33 deletions(-)
>
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 3bc72a4e4032..c939cd3d8a2c 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -246,30 +246,12 @@ static inline void hung_task_diagnostics(struct task_struct *t)
> pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\" disables this message.\n");
> }
>
> -static void check_hung_task(struct task_struct *t, unsigned long timeout,
> - unsigned long prev_detect_count)
> +static void hung_task_info(struct task_struct *t, unsigned long timeout,
> + unsigned long this_round_count)
> {
> - unsigned long total_hung_task, cur_detect_count;
> -
> - if (!task_is_hung(t, timeout))
> - return;
> -
> - /*
> - * This counter tracks the total number of tasks detected as hung
> - * since boot. If a reset occurred during the scan, we treat the
> - * current count as the new delta to avoid an underflow error.
> - * Ensure hang details are globally visible before the counter
> - * update.
> - */
> - cur_detect_count = atomic_long_inc_return_release(&sysctl_hung_task_detect_count);
> - if (cur_detect_count >= prev_detect_count)
> - total_hung_task = cur_detect_count - prev_detect_count;
> - else
> - total_hung_task = cur_detect_count;
> -
> trace_sched_process_hang(t);
>
> - if (sysctl_hung_task_panic && total_hung_task >= sysctl_hung_task_panic) {
> + if (sysctl_hung_task_panic && this_round_count >= sysctl_hung_task_panic) {
> console_verbose();
> hung_task_call_panic = true;
> }
> @@ -325,12 +307,13 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
> int max_count = sysctl_hung_task_check_count;
> unsigned long last_break = jiffies;
> struct task_struct *g, *t;
> - unsigned long cur_detect_count, prev_detect_count, delta;
> + unsigned long total_count, this_round_count;
> int need_warning = sysctl_hung_task_warnings;
> unsigned long si_mask = hung_task_si_mask;
>
> - /* Acquire prevents reordering task checks before this point. */
> - prev_detect_count = atomic_long_read_acquire(&sysctl_hung_task_detect_count);
> + /* The counter might get reset. Remember the initial value. */
> + total_count = atomic_long_read_acquire(&sysctl_hung_task_detect_count);
> +
> /*
> * If the system crashed already then all bets are off,
> * do not report extra hung tasks:
> @@ -339,6 +322,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
> return;
>
>
> + this_round_count = 0UL;
> rcu_read_lock();
> for_each_process_thread(g, t) {
>
> @@ -350,21 +334,24 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
> last_break = jiffies;
> }
>
> - check_hung_task(t, timeout, prev_detect_count);
> + if (task_is_hung(t, timeout)) {
> + this_round_count++;
> + hung_task_info(t, timeout, this_round_count);
> + }
> }
> unlock:
> rcu_read_unlock();
>
> - /* Ensures we see all hang details recorded during the scan. */
> - cur_detect_count = atomic_long_read_acquire(&sysctl_hung_task_detect_count);
> - if (cur_detect_count < prev_detect_count)
> - delta = cur_detect_count;
> - else
> - delta = cur_detect_count - prev_detect_count;
> -
> - if (!delta)
> + if (!this_round_count)
> return;
>
> + /*
> + * Do not count this round when the global counter has been reset
> + * during this check.
> + */
> + atomic_long_cmpxchg_release(&sysctl_hung_task_detect_count, total_count,
> + total_count + this_round_count);
> +
> if (need_warning || hung_task_call_panic) {
> si_mask |= SYS_INFO_LOCKS;
>
In general, the POC makes a lot of sense ;)
The acquire/release pairing is now straightforward:
- _read_acquire at start
- local counting (no atomics)
- _cmpxchg_release at end
And yeah, if cmpxchg fails, we simply do not count this round,
which is reasonable behavior.
So, Aaron, can you take care of refactoring/testing based on
Petr's approach?
Cheers,
Lance
Powered by blists - more mailing lists