lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-Id: <20250809173945.1953141-1-jackzxcui1989@163.com>
Date: Sun, 10 Aug 2025 01:39:45 +0800
From: Xin Zhao <jackzxcui1989@....com>
To: mingo@...hat.com,
	peterz@...radead.org,
	juri.lelli@...hat.com,
	vincent.guittot@...aro.org,
	dietmar.eggemann@....com,
	rostedt@...dmis.org,
	bsegall@...gle.com,
	mgorman@...e.de,
	vschneid@...hat.com,
	bigeasy@...utronix.de,
	clrkwllms@...nel.org
Cc: linux-kernel@...r.kernel.org,
	linux-rt-devel@...ts.linux.dev,
	Xin Zhao <jackzxcui1989@....com>
Subject: [PATCH] sched/fair: Make the BW replenish timer expire in hardirq context for PREEMPT_RT

Valentin Schneider has made changes in 2023 to move the callback logic of
the BW replenish timer into the hardirq context in the PREEMPT_RT system.
Given that the PREEMPT_RT code has already been merged into the mainline
and considering the serious impact of this issue, I believe it is
essential to incorporate this change into the mainline to prevent others
using the PREEMPT_RT system from encountering this frustrating problem.
Our project has also encountered the same issue on the Linux 6.1.134
RT-Linux kernel, and I have written a reproducible program that almost
100% triggers the panic caused by this problem on our system.
The reproducible program is inspired by previous discussions related to
the cgroup issue, which contained the following description:
    Consider the following scenario under PREEMPT_RT:
    o A CFS task p0 gets throttled while holding read_lock(&lock)
    o A task p1 blocks on write_lock(&lock), making further readers enter the
      slowpath
    o A ktimers or ksoftirqd task blocks on read_lock(&lock)
    If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued on
    the same CPU as one where ktimers/ksoftirqd is blocked on read_lock(&lock),
    this creates a circular dependency.
My reproducible program also follows the logic described above. The detailed
implementation of the 100% reproducible program runs on a Linux 6.1.134
RT-Linux system with only six CPUs.
I wrote a kernel module named testcgroupbug.ko, which creates a proc node that
responds to an ioctl command. Depending on the mode parameter passed, it
performs different operations. There are two modes: read mode and write mode.
In read mode, it executes read_lock and then enters a loop, waiting for a
write-mode task to exist before executing a dead loop that lasts for 0.25
seconds, registering a pinned hrtimer in HRTIMER_MODE_ABS_PINNED mode,
and finally executing read_unlock once the specified timeout expires.
In write mode, it executes write_lock and waits for the specified timeout to
expire before executing write_unlock.
The script that runs the program is structured as follows (the main idea is
to keep all CPUs busy except CPU 1, inducing the system to choose CPU 1 for
executing the related cgroup period timer operations. Thus, when the
read-mode task enters the kernel via ioctl, the creation of the soft-mode
pinned hrtimer also occurs on CPU 1, thereby forming the aforementioned
circular dependency):
taskset -c 0 ./deadloop &
taskset -c 2 ./deadloop &
taskset -c 3 ./deadloop &
taskset -c 4 ./deadloop &
taskset -c 5 ./deadloop &
sleep 3
mkdir /sys/fs/cgroup/test
echo "500000 1000000" > /sys/fs/cgroup/test/cpu.max
sleep 1
taskset -c 1 ./rwlock_read 5 &
pid=$!
echo $pid > /sys/fs/cgroup/test/cgroup.procs
chrt -f 60 ./rwlock_write 3 &
In the script above, rwlock_read is the read-mode function executed in the
kernel module, where 5 indicates the timeout period of 5 seconds. The
rwlock_write function represents the write-mode function executed in the
kernel module, where 3 indicates the timeout period of 3 seconds.

Signed-off-by: Xin Zhao <jackzxcui1989@....com>
---
 kernel/sched/fair.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a0593..54c998661 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6456,8 +6456,13 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *paren
 	cfs_b->hierarchical_quota = parent ? parent->hierarchical_quota : RUNTIME_INF;
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
+#ifdef CONFIG_PREEMPT_RT
+	hrtimer_setup(&cfs_b->period_timer, sched_cfs_period_timer, CLOCK_MONOTONIC,
+		      HRTIMER_MODE_ABS_PINNED_HARD);
+#else
 	hrtimer_setup(&cfs_b->period_timer, sched_cfs_period_timer, CLOCK_MONOTONIC,
 		      HRTIMER_MODE_ABS_PINNED);
+#endif
 
 	/* Add a random offset so that timers interleave */
 	hrtimer_set_expires(&cfs_b->period_timer,
@@ -6483,7 +6488,11 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 
 	cfs_b->period_active = 1;
 	hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+#ifdef CONFIG_PREEMPT_RT
+	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED_HARD);
+#else
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
+#endif
 }
 
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
-- 
2.34.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ