lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <a732cb65347843e8b5fdbe182363c5438ac0916f.1764648076.git.wen.yang@linux.dev>
Date: Tue,  2 Dec 2025 13:51:19 +0800
From: wen.yang@...ux.dev
To: Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>
Cc: Wen Yang <wen.yang@...ux.dev>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Ben Segall <bsegall@...gle.com>,
	Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	linux-kernel@...r.kernel.org
Subject: [PATCH 2/2] sched/rt: add RT throttle statistics

From: Wen Yang <wen.yang@...ux.dev>

A priority inversion scenario can occur when a CFS task is starved
due to RT throttling. The scenario is as follows:

0. An rtmutex (e.g., softirq_ctrl.lock) is contended by both CFS
   tasks (e.g., ksoftirqd) and RT tasks (e.g., ktimer).
1. An RT task 'A' (e.g., ktimer) acquired the rtmutex.
2. A CFS task 'B' (e.g., ksoftirqd) attempts to acquire the same
   rtmutex and blocks.
3. A higher-priority RT task 'C' (e.g., stress-ng) runs for an
   extended period, preempting task 'A' and causing the RT runqueue
   to be throttled.
4. Once rt throttled, CFS task 'B' should run, but it remains blocked
   because the lock is still held by the non-running RT task 'A'. This
   can even lead to the CPU going idle.
5. When the rt throttle period ends, the high-priority RT task 'C'
   resumes execution, and the cycle repeats, leading to indefinite
   starvation of CFS task 'B'.

A typical stack trace for the blocked ksoftirqd shows it in a 'D'
(TASK_RTLOCK_WAIT) state, waiting on the lock:
     ksoftirqd/5-61      [005] d...211 58212.064160: sched_switch: prev_comm=ksoftirqd/5 prev_pid=61 prev_prio=120 prev_state=D ==> next_comm=swapper/5 next_pid=0 next_prio=120
     ksoftirqd/5-61      [005] d...211 58212.064161: <stack trace>
 => __schedule
 => schedule_rtlock
 => rtlock_slowlock_locked
 => rt_spin_lock
 => __local_bh_disable_ip
 => run_ksoftirqd
 => smpboot_thread_fn
 => kthread
 => ret_from_fork

This patch adds throttle_count to rt_rq, incremented on each throttling event
and displayed in print_rt_rq for /proc/sched_debug.

Thus user-space tools (e.g. stalld) can monitor throttle_comunt to detect
the huge CPU consumption by RT processes and find tasks in the
'TASK_RTLOCK_WAIT' state to handle priority inversion.

Signed-off-by: Wen Yang <wen.yang@...ux.dev>
Cc: Ingo Molnar <mingo@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Juri Lelli <juri.lelli@...hat.com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Steven Rostedt <rostedt@...dmis.org>
Cc: Ben Segall <bsegall@...gle.com>
Cc: Mel Gorman <mgorman@...e.de>
Cc: Valentin Schneider <vschneid@...hat.com>
Cc: linux-kernel@...r.kernel.org
---
 kernel/sched/debug.c | 1 +
 kernel/sched/rt.c    | 1 +
 kernel/sched/sched.h | 1 +
 3 files changed, 3 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 41caa22e0680..8ed33c74e5a5 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -894,6 +894,7 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
 	P(rt_throttled);
 	PN(rt_time);
 	PN(rt_runtime);
+	PU(throttle_count);
 #endif
 
 #undef PN
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f1867fe8e5c5..88c659285c70 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -884,6 +884,7 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
 		 */
 		if (likely(rt_b->rt_runtime)) {
 			rt_rq->rt_throttled = 1;
+			rt_rq->throttle_count++;
 			printk_deferred_once("sched: RT throttling activated\n");
 		} else {
 			/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bbf513b3e76c..88119540e4d4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -840,6 +840,7 @@ struct rt_rq {
 	int			rt_throttled;
 	u64			rt_time; /* consumed RT time, goes up in update_curr_rt */
 	u64			rt_runtime; /* allotted RT time, "slice" from rt_bandwidth, RT sharing/balancing */
+	u64			throttle_count;
 	/* Nests inside the rq lock: */
 	raw_spinlock_t		rt_runtime_lock;
 
-- 
2.25.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ