linux-kernel - Re: [PATCH v3 2/2] hung_task: Enable runtime reset of hung_task_detect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2a5bf780-bce0-4642-b375-f464add671d6@linux.dev>
Date: Fri, 19 Dec 2025 22:15:14 +0800
From: Lance Yang <lance.yang@...ux.dev>
To: Petr Mladek <pmladek@...e.com>, Aaron Tomlin <atomlin@...mlin.com>
Cc: sean@...e.io, linux-kernel@...r.kernel.org, mhiramat@...nel.org,
 akpm@...ux-foundation.org, gregkh@...uxfoundation.org
Subject: Re: [PATCH v3 2/2] hung_task: Enable runtime reset of
 hung_task_detect_count



On 2025/12/19 20:03, Petr Mladek wrote:
> On Thu 2025-12-18 22:09:57, Aaron Tomlin wrote:
>> On Wed, Dec 17, 2025 at 02:09:14PM +0100, Petr Mladek wrote:
>>> The counter is used to decide how many hung tasks were found in
>>> by a single check, see:
>>>
>>> Any race might cause false positives and even panic() !!!
>>> And the race window is rather big (checking all processes).
>>>
>>> This race can't be prevented by an atomic read/write.
>>> IMHO, the only solution would be to add some locking,
>>> e.g. mutex and take it around the entire
>>> check_hung_uninterruptible_tasks().
> 
> [...]
> 
>> Thank you for the feedback regarding the potential for false positives and
>> erroneous panics. You are absolutely correct that the race window is
>> substantial, given that it spans the entire duration of a full
>> process-thread iteration.
>>
>> I believe we can resolve this race condition and prevent the integer
>> underflow without the overhead or risk of a mutex by leveraging hardware
>> atomicity and a slight adjustment to the delta logic.
>>
>> Specifically, by converting sysctl_hung_task_detect_count to an
>> atomic_long_t, we ensure that the increment and reset operations are
>> indivisible at the CPU level. To handle the "reset mid-scan" issue you
>> highlighted, we can modify the delta calculation to be "reset-aware":
> 
>> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
>> index 01ce46a107b0..74725dc1efd0 100644
>> --- a/kernel/hung_task.c
>> +++ b/kernel/hung_task.c
>> @@ -17,6 +17,7 @@
>>   #include <linux/export.h>
>>   #include <linux/panic_notifier.h>
>>   #include <linux/sysctl.h>
>> +#include <linux/atomic.h>
>>   #include <linux/suspend.h>
>>   #include <linux/utsname.h>
>>   #include <linux/sched/signal.h>
>> @@ -36,7 +37,7 @@ static int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT;
>>   /*
>>    * Total number of tasks detected as hung since boot:
>>    */
>> -static unsigned long __read_mostly sysctl_hung_task_detect_count;
>> +static atomic_long_t atomic_long_t sysctl_hung_task_detect_count = ATOMIC_LONG_INIT(0);

Double atomic_long_t here.

>>   
>>   /*
>>    * Limit number of tasks checked in a batch.
>> @@ -253,6 +254,7 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout,
>>   		unsigned long prev_detect_count)
>>   {
>>   	unsigned long total_hung_task;
>> +	unsigned long current_detect;
>>   
>>   	if (!task_is_hung(t, timeout))
>>   		return;
>> @@ -261,9 +263,14 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout,
>>   	 * This counter tracks the total number of tasks detected as hung
>>   	 * since boot.
>>   	 */
>> -	sysctl_hung_task_detect_count++;
>> +	atomic_long_inc(&sysctl_hung_task_detect_count);
>> +
>> +	current_detect = atomic_long_read(&sysctl_hung_task_detect_count);
> 
> The two atomic_long_inc() and atomic_long_read() can be replaced
> by atomic_long_inc_return().

+1, cleaner and potentially cheaper on some archs :)

> 
>> +	if (current_detect >= prev_detect_count)
>> +		total_hung_task = current_detect - prev_detect_count;
>> +	else
>> +		total_hung_task = current_detect;
> 
> Interesting trick. The result may not be precise when the counter
> is reset in the middle of the check. But it would prevent
> calling panic() because of a wrong computation.

Clever! If it's a real persistent hang, we'll catch it next
time anyway.

But yeah, add a comment explaining that we're ok with
undercounting one scan if counter gets reset mid-flight.

Future readers will thank you ;)

> 
> It should be acceptable. The panic() is needed when the lockup looks
> permanent. And if the lockup is permanent then the right
> total_hung_task number will be computed by the next check.
> 
> Please, add a comment into the code explaining why it is done this way
> and why it is OK.
> 
>> -	total_hung_task = sysctl_hung_task_detect_count - prev_detect_count;
>>   	trace_sched_process_hang(t);
>>   
>>   	if (sysctl_hung_task_panic && total_hung_task >= sysctl_hung_task_panic) {
>> @@ -322,7 +329,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
>>   	int max_count = sysctl_hung_task_check_count;
>>   	unsigned long last_break = jiffies;
>>   	struct task_struct *g, *t;
>> -	unsigned long prev_detect_count = sysctl_hung_task_detect_count;
>> +	unsigned long prev_detect_count = atomic_long_read(&sysctl_hung_task_detect_count);
> 
> I think that it is not super important in this case. But this is a kind of
> locking, so I would use some proper annotation/barriers. IMHO, we
> should use here:

Right, memory ordering matters!

> 
> 	atomic_long_read_acquire()
> 
> And the counter part would be
> 
> 	atomic_long_inc_return_release()
> 
>>   	int need_warning = sysctl_hung_task_warnings;
>>   	unsigned long si_mask = hung_task_si_mask;
>>   
>> @@ -350,7 +357,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
>>    unlock:
>>   	rcu_read_unlock();
>>   
>> -	if (!(sysctl_hung_task_detect_count - prev_detect_count))
>> +	if (!(atomic_long_read(&sysctl_hung_task_detect_count) - prev_detect_count))
> 
> I guess that we should do the same check here. Otherwise, we might
> print the sys_info() just because the counter has been reset
> in the meantime.

Good catch. Same underflow check should apply here.

> 
> And from the locking/barriers POV, we should use atomic_long_read_release() here.
> 
>>   		return;
>>   
>>   	if (need_warning || hung_task_call_panic) {
>> @@ -391,10 +398,15 @@ static long hung_timeout_jiffies(unsigned long last_checked,
>>   static int proc_dohung_task_detect_count(const struct ctl_table *table, int write,
>>   					 void *buffer, size_t *lenp, loff_t *ppos)
>>   {
>> -	if (!write)
>> -		return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
>> +	if (!write) {
>> +		unsigned long count = atomic_long_read(&sysctl_hung_task_detect_count);
>> +		struct ctl_table proxy_table = *table;
>> +
>> +		proxy_table.data = &count;
>> +		return proc_doulongvec_minmax(&proxy_table, write, buffer, lenp, ppos);
> 
> If we are going to use the proxy table then we are already half way
> to add the check of written value. Please, return -EINVAL when user
> writes a non-zero value as I suggested at
> https://lore.kernel.org/r/aUKmh0Gs8YH_iLbC@pathway.suse.cz

Explicit is better. Echo 0 to reset, anything else is an error :)

> 
> +	}
>>   
>> -	WRITE_ONCE(sysctl_hung_task_detect_count, 0);
>> +	atomic_long_set(&sysctl_hung_task_detect_count, 0);
>>   	*ppos += *lenp;
>>   
>>   	return 0;
>>
>> Please note that this updated approach has not been tested yet, but I
>> believe it addresses the core concerns while maintaining the architectural
>> integrity of the detector.
> 
> This approach looks acceptable to me.

Yeah, I think this is the way to go. Much better than adding a
lock to hung task detector ...

> 
> I though also about using sequence counters, see
> Documentation/locking/seqlock.rst. It would allow to restart the scan
> when the counter has been reset in the meantime. But I think that it
> is not worth it. The counter is important only to decide whether
> to call panic() or not. And it makes sense to call panic() only
> when the stall is persistent. So, it is perfectly fine to wait
> for the next scan.
> 
> Best Regards,
> Petr