lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241027120747.42833-2-ioworker0@gmail.com>
Date: Sun, 27 Oct 2024 20:07:46 +0800
From: Lance Yang <ioworker0@...il.com>
To: akpm@...ux-foundation.org
Cc: dj456119@...il.com,
	cunhuang@...cent.com,
	leonylgao@...cent.com,
	j.granados@...sung.com,
	jsiddle@...hat.com,
	kent.overstreet@...ux.dev,
	21cnbao@...il.com,
	ryan.roberts@....com,
	david@...hat.com,
	ziy@...dia.com,
	libang.li@...group.com,
	baolin.wang@...ux.alibaba.com,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	joel.granados@...nel.org,
	linux@...ssschuh.net,
	Lance Yang <ioworker0@...il.com>,
	Mingzhe Yang <mingzhe.yang@...com>
Subject: [PATCH v2 1/2] hung_task: add detect count for hung tasks

This commit adds a counter, hung_task_detect_count, to track the number of
times hung tasks are detected.

IHMO, hung tasks are a critical metric. Currently, we detect them by
periodically parsing dmesg. However, this method isn't as user-friendly as
using a counter.

Sometimes, a short-lived issue with NIC or hard drive can quickly decrease
the hung_task_warnings to zero. Without warnings, we must directly access
the node to ensure that there are no more hung tasks and that the system
has recovered. After all, load average alone cannot provide a clear
picture.

Once this counter is in place, in a high-density deployment pattern, we
plan to set hung_task_timeout_secs to a lower number to improve stability,
even though this might result in false positives. And then we can set a
time-based threshold: if hung tasks last beyond this duration, we will
automatically migrate containers to other nodes. Based on past experience,
this approach could help avoid many production disruptions.

Moreover, just like other important events such as OOM that already have
counters, having a dedicated counter for hung tasks makes sense.

Signed-off-by: Mingzhe Yang <mingzhe.yang@...com>
Signed-off-by: Lance Yang <ioworker0@...il.com>
---
 kernel/hung_task.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 959d99583d1c..229ff3d4e501 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -30,6 +30,11 @@
  */
 static int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT;
 
+/*
+ * Total number of tasks detected as hung since boot:
+ */
+static unsigned long __read_mostly sysctl_hung_task_detect_count;
+
 /*
  * Limit number of tasks checked in a batch.
  *
@@ -115,6 +120,12 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
 	if (time_is_after_jiffies(t->last_switch_time + timeout * HZ))
 		return;
 
+	/*
+	 * This counter tracks the total number of tasks detected as hung
+	 * since boot.
+	 */
+	sysctl_hung_task_detect_count++;
+
 	trace_sched_process_hang(t);
 
 	if (sysctl_hung_task_panic) {
@@ -314,6 +325,13 @@ static struct ctl_table hung_task_sysctls[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_NEG_ONE,
 	},
+	{
+		.procname	= "hung_task_detect_count",
+		.data		= &sysctl_hung_task_detect_count,
+		.maxlen		= sizeof(unsigned long),
+		.mode		= 0444,
+		.proc_handler	= proc_dointvec,
+	},
 };
 
 static void __init hung_task_sysctl_init(void)
-- 
2.45.2


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ