linux-kernel - [v6 PATCH 2/2] hung_task: Enable runtime reset of hung_task_detect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260115023229.3028462-3-atomlin@atomlin.com>
Date: Wed, 14 Jan 2026 21:32:29 -0500
From: Aaron Tomlin <atomlin@...mlin.com>
To: akpm@...ux-foundation.org,
	lance.yang@...ux.dev,
	mhiramat@...nel.org,
	gregkh@...uxfoundation.org,
	pmladek@...e.com,
	joel.granados@...nel.org
Cc: sean@...e.io,
	linux-kernel@...r.kernel.org
Subject: [v6 PATCH 2/2] hung_task: Enable runtime reset of hung_task_detect_count

Currently, the hung_task_detect_count sysctl provides a cumulative count
of hung tasks since boot. In long-running, high-availability
environments, this counter may lose its utility if it cannot be reset
once an incident has been resolved. Furthermore, the previous
implementation relied upon implicit ordering, which could not strictly
guarantee that diagnostic metadata published by one CPU was visible to
the panic logic on another.

This patch introduces the capability to reset the detection count by
writing "0" to the hung_task_detect_count sysctl. The proc_handler logic
has been updated to validate this input and atomically reset the
counter.

The synchronisation of sysctl_hung_task_detect_count relies upon a
transactional model to ensure the integrity of the detection counter
against concurrent resets from userspace. The application of
atomic_long_read_acquire() and atomic_long_cmpxchg_release() is correct
and provides the following guarantees:

    1. Prevention of Load-Store Reordering via Acquire Semantics By
       utilising atomic_long_read_acquire() to snapshot the counter
       before initiating the task traversal, we establish a strict
       memory barrier. This prevents the compiler or hardware from
       reordering the initial load to a point later in the scan. Without
       this "acquire" barrier, a delayed load could potentially read a
       "0" value resulting from a userspace reset that occurred
       mid-scan. This would lead to the subsequent cmpxchg succeeding
       erroneously, thereby overwriting the user's reset with stale
       increment data.

    2. Atomicity of the "Commit" Phase via Release Semantics The
       atomic_long_cmpxchg_release() serves as the transaction's commit
       point. The "release" barrier ensures that all diagnostic
       recordings and task-state observations made during the scan are
       globally visible before the counter is incremented.

    3. Race Condition Resolution This pairing effectively detects any
       "out-of-band" reset of the counter. If
       sysctl_hung_task_detect_count is modified via the procfs
       interface during the scan, the final cmpxchg will detect the
       discrepancy between the current value and the "acquire" snapshot.
       Consequently, the update will fail, ensuring that a reset command
       from the administrator is prioritised over a scan that may have
       been invalidated by that very reset.

Signed-off-by: Aaron Tomlin <atomlin@...mlin.com>
---
 Documentation/admin-guide/sysctl/kernel.rst |   3 +-
 kernel/hung_task.c                          | 109 +++++++++++++-------
 2 files changed, 75 insertions(+), 37 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 239da22c4e28..68da4235225a 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -418,7 +418,8 @@ hung_task_detect_count
 ======================
 
 Indicates the total number of tasks that have been detected as hung since
-the system boot.
+the system boot or since the counter was reset. The counter is zeroed when
+a value of 0 is written.
 
 This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
 
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index b5ad7a755eb5..2eb9c861bdcc 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -224,24 +224,43 @@ static inline void debug_show_blocker(struct task_struct *task, unsigned long ti
 }
 #endif
 
-static void check_hung_task(struct task_struct *t, unsigned long timeout,
-			    unsigned long prev_detect_count)
+/**
+ * hung_task_diagnostics - Print structured diagnostic info for a hung task.
+ * @t: Pointer to the detected hung task.
+ *
+ * This function consolidates the printing of core diagnostic information
+ * for a task found to be blocked.
+ */
+static inline void hung_task_diagnostics(struct task_struct *t)
 {
-	unsigned long total_hung_task, cur_detect_count;
-
-	if (!task_is_hung(t, timeout))
-		return;
-
-	/*
-	 * This counter tracks the total number of tasks detected as hung
-	 * since boot.
-	 */
-	cur_detect_count = atomic_long_inc_return_relaxed(&sysctl_hung_task_detect_count);
-	total_hung_task = cur_detect_count - prev_detect_count;
+	unsigned long blocked_secs = (jiffies - t->last_switch_time) / HZ;
+
+	pr_err("INFO: task %s:%d blocked for more than %ld seconds.\n",
+		t->comm, t->pid, blocked_secs);
+	pr_err("      %s %s %.*s\n",
+		print_tainted(), init_utsname()->release,
+		(int)strcspn(init_utsname()->version, " "),
+		init_utsname()->version);
+	if (t->flags & PF_POSTCOREDUMP)
+		pr_err("      Blocked by coredump.\n");
+	pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\" disables this message.\n");
+}
 
+/**
+ * hung_task_info - Print diagnostic details for a hung task
+ * @t: Pointer to the detected hung task.
+ * @timeout: Timeout threshold for detecting hung tasks
+ * @this_round_count: Count of hung tasks detected in the current iteration
+ *
+ * Print structured information about the specified hung task, if warnings
+ * are enabled or if the panic batch threshold is exceeded.
+ */
+static void hung_task_info(struct task_struct *t, unsigned long timeout,
+			   unsigned long this_round_count)
+{
 	trace_sched_process_hang(t);
 
-	if (sysctl_hung_task_panic && total_hung_task >= sysctl_hung_task_panic) {
+	if (sysctl_hung_task_panic && this_round_count >= sysctl_hung_task_panic) {
 		console_verbose();
 		hung_task_call_panic = true;
 	}
@@ -251,18 +270,7 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout,
 	 * complain:
 	 */
 	if (sysctl_hung_task_warnings || hung_task_call_panic) {
-		if (sysctl_hung_task_warnings > 0)
-			sysctl_hung_task_warnings--;
-		pr_err("INFO: task %s:%d blocked for more than %ld seconds.\n",
-		       t->comm, t->pid, (jiffies - t->last_switch_time) / HZ);
-		pr_err("      %s %s %.*s\n",
-			print_tainted(), init_utsname()->release,
-			(int)strcspn(init_utsname()->version, " "),
-			init_utsname()->version);
-		if (t->flags & PF_POSTCOREDUMP)
-			pr_err("      Blocked by coredump.\n");
-		pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\""
-			" disables this message.\n");
+		hung_task_diagnostics(t);
 		sched_show_task(t);
 		debug_show_blocker(t, timeout);
 
@@ -306,11 +314,14 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
 	int max_count = sysctl_hung_task_check_count;
 	unsigned long last_break = jiffies;
 	struct task_struct *g, *t;
-	unsigned long prev_detect_count;
+	unsigned long total_count, this_round_count;
 	int need_warning = sysctl_hung_task_warnings;
 	unsigned long si_mask = hung_task_si_mask;
 
-	prev_detect_count = atomic_long_read(&sysctl_hung_task_detect_count);
+	/* The counter might get reset. Remember the initial value.
+	 * Acquire prevents reordering task checks before this point.
+	 */
+	total_count = atomic_long_read_acquire(&sysctl_hung_task_detect_count);
 	/*
 	 * If the system crashed already then all bets are off,
 	 * do not report extra hung tasks:
@@ -318,7 +329,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
 	if (test_taint(TAINT_DIE) || did_panic)
 		return;
 
-
+	this_round_count = 0;
 	rcu_read_lock();
 	for_each_process_thread(g, t) {
 
@@ -330,15 +341,26 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
 			last_break = jiffies;
 		}
 
-		check_hung_task(t, timeout, prev_detect_count);
+		if (task_is_hung(t, timeout)) {
+			this_round_count++;
+			hung_task_info(t, timeout, this_round_count);
+		}
 	}
  unlock:
 	rcu_read_unlock();
 
-	if (!(atomic_long_read(&sysctl_hung_task_detect_count) -
-				prev_detect_count))
+	if (!this_round_count)
 		return;
 
+	/*
+	 * Do not count this round when the global counter has been reset
+	 * during this check. Release ensures we see all hang details
+	 * recorded during the scan.
+	 */
+	atomic_long_cmpxchg_release(&sysctl_hung_task_detect_count,
+				    total_count, total_count +
+				    this_round_count);
+
 	if (need_warning || hung_task_call_panic) {
 		si_mask |= SYS_INFO_LOCKS;
 
@@ -370,20 +392,35 @@ static long hung_timeout_jiffies(unsigned long last_checked,
  * @lenp: Pointer to the length of the data being transferred
  * @ppos: Pointer to the current file offset
  *
- * This handler is used for reading the current hung task detection count.
- * Returns 0 on success or a negative error code on failure.
+ * This handler is used for reading the current hung task detection count
+ * and for resetting it to zero when a write operation is performed using a
+ * zero value only. Returns 0 on success or a negative error code on
+ * failure.
  */
 static int proc_dohung_task_detect_count(const struct ctl_table *table, int dir,
 					 void *buffer, size_t *lenp, loff_t *ppos)
 {
 	unsigned long detect_count;
 	struct ctl_table proxy_table;
+	int err;
 
-	detect_count = atomic_long_read(&sysctl_hung_task_detect_count);
 	proxy_table = *table;
 	proxy_table.data = &detect_count;
 
-	return proc_doulongvec_minmax(&proxy_table, dir, buffer, lenp, ppos);
+	if (SYSCTL_KERN_TO_USER(dir))
+		detect_count = atomic_long_read(&sysctl_hung_task_detect_count);
+
+	err = proc_doulongvec_minmax(&proxy_table, dir, buffer, lenp, ppos);
+	if (err < 0)
+		return err;
+
+	if (SYSCTL_USER_TO_KERN(dir)) {
+		if (detect_count)
+			return -EINVAL;
+		atomic_long_set(&sysctl_hung_task_detect_count, 0);
+	}
+
+	return 0;
 }
 
 /*
-- 
2.51.0