lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <rcpyt6ehul465cchdm6tun4diwfrs4pr2fvvk7zpmc4h6ycrrg@djcftft7r443>
Date: Mon, 5 Jan 2026 09:42:08 -0500
From: Aaron Tomlin <atomlin@...mlin.com>
To: Joel Granados <joel.granados@...nel.org>
Cc: akpm@...ux-foundation.org, lance.yang@...ux.dev, mhiramat@...nel.org, 
	gregkh@...uxfoundation.org, pmladek@...e.com, neelx@...e.com, sean@...e.io, 
	linux-kernel@...r.kernel.org
Subject: Re: [v5 PATCH 2/2] hung_task: Enable runtime reset of
 hung_task_detect_count

On Mon, Jan 05, 2026 at 11:53:08AM +0100, Joel Granados wrote:
> It is clear to me that you need a custom handler after changing the type
> to an atomic_long_t.
> 
> It is not clear to my why you changed it to atomic_long_t. It was
> already being modified,read,written when it was a raw unsigned long.
> What has changed that requires atomic_long_t?

Hi Joel,

Thank you very much for your review and for raising this important
question. I appreciate the opportunity to clarify the reasoning behind the
type change.

The rationale for switching to atomic_long_t is primarily due to the shift
from a Single-Writer model to a Multi-Writer model, alongside the need for
strict memory ordering which standard unsigned long operations cannot
guarantee.

1. Atomicity (precluding the "Lost Update" scenario) In the existing
   implementation, sysctl_hung_task_detect_count is effectively read-only
   for userspace. There is only one writer: the watchdog thread. By
   introducing the ability for an administrator to write '0' (reset), we
   create a race condition between:

    - The watchdog thread attempting to increment the counter
      (detect_count++)

    - The administrator attempting to reset the counter (detect_count = 0)

On many architectures, a standard increment is a non-atomic
Read-Modify-Write sequence. Without atomic_long_inc(), we would risk a
"Lost Update" scenario where a reset occurs precisely between the
watchdog's load and store instructions, causing the counter to revert to a
high value immediately after the reset.

2. Memory Ordering (Acquire/Release Semantics) Beyond basic atomicity, the
   use of atomic_long_inc_return_release() and atomic_long_read_acquire()
   is necessary to enforce correct synchronisation without the overhead of
   full memory barriers

    - Release semantics (in check_hung_task()): We utilise Release semantics
      when incrementing the counter. This guarantees that all memory
      operations associated with detecting the hang (the reads inside
      task_is_hung()) are globally visible before the counter is updated.
      This effectively "publishes" the hang event only after the detection
      logic is fully complete

    - Acquire semantics (in check_hung_uninterruptible_tasks()): We utilise
      Acquire semantics when reading the counter in the summary loop. This
      ensures that we do not reorder the loop's subsequent checks to occur
      before we have observed the current counter state

If we were to rely on _relaxed atomics or a simple unsigned long, the CPU
could theoretically reorder these operations, potentially leading to
inconsistencies where the summary loop observes a counter increment but
sees stale task state data.

I hope this explanation clarifies the necessity of these changes. Please
let me know if you have any further queries.


Kind regards,
-- 
Aaron Tomlin

Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ