linux-kernel - BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <3308bca2-624e-42a3-8d98-48751acaa3b3@codethink.co.uk>
Date: Fri, 19 Sep 2025 12:10:34 +0100
From: Ben Dooks <ben.dooks@...ethink.co.uk>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
 Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 Matteo Martelli <matteo.martelli@...ethink.co.uk>,
 Marcel Ziswiler <marcel.ziswiler@...ethink.co.uk>
Subject: BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2

We are doing some testing with stress-ng and the cgroup-v2 enabled
(CONFIG_RT_GROUP_SCHED) and are running into WARN/BUG within a minute
related to user-space calling sched_setattr() and possibly other calls.

At the moment we're not sure if the WARN and BUG calls are entirely
correct, we are considering there may be some sort of race condition
which is causing incorrect assumptions in the code.

We are seeing this kernel bug in pick_next_rt_entity being triggered

	idx = sched_find_first_bit(array->bitmap);
	BUG_ON(idx >= MAX_RT_PRIO);

Which suggests that the pick_task_rt() ran, thought there was something
there to schedule and got into pick_next_rt_entity() which then found
there was nothing. It does this by checking rq->rt.rt_queued before it
bothers to try picking something to run.

(this BUG_ON() is triggered if there is no index in the array indicating
  something there to run)

We added some debug to find out what the values in pick_next_rt_entity()
with the current rt_queued and the value it was when pick_task_rt()
looked, and we got:

    idx 100 bigger than MAX_RT_PRIO 100, queued = 0 (queued was 1)

This shows the code was entered with the rt_q showing something
should have been queued and by the time the pick_next_rt_entity()
was entered there seems to be nothing (assuming the array is in
sync with the lists...)

I think the two questions we have are:

- Is the BUG_ON() here appropriate, should a WARN_ON_ONCE() and
   return NULL be the best way of handling this? I am going to try
   this and see if the system is still runnable with this.

- Are we seeing a race here, and if so where is the best place to
   prevent it?

Note, we do have a few local backported cgroup-v2 patches.

Our systemd unit file to launch the test is here:

[Service]
Type=simple
Restart=always
ExecStartPre=/bin/sh -c 'echo 500000 > 
/sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
ExecStartPre=/bin/sh -c 'echo 500000 > 
/sys/fs/cgroup/system.slice/stress-sched-long-system.service/cpu.rt_runtime_us'
ExecStart=sandbox-run /usr/bin/stress-ng --temp-path /tmp/stress-ng 
--timeout=0 --verify --oom-avoid --metrics --timestamp 
--exclude=enosys,usersyscall --cpu-sched 0 --timeout 60 --verbose 
--stressor-time
Environment=SANDBOX_RO_BINDMOUNTS="/usr/share/stress-ng"
Environment=SANDBOX_RW_BINDMOUNTS="/var/log /sys /proc /dev /tmp/stress-ng"
Environment=SANDBOX_EXTRA_ARGS="--cwd /tmp/stress-ng --keep_caps 
--disable_rlimits --disable_clone_newuser"
Slice=system.slice
OOMPolicy=continue

I added this to dump the array and confirm at-least the array-v-list
was in sync at the point of the bug:

static inline void debug_pick_next(struct rt_rq *rt_rq, int idx, 
unsigned qs)
{
	struct rt_prio_array *array = &rt_rq->active;
	unsigned int nr;

	pr_err("rt_q %px: idx %d bigger than MAX_RT_PRIO %d, queued = %d (was 
%u)\n",
	       rt_rq, idx, MAX_RT_PRIO, rt_rq->rt_queued, qs );

	for (nr = 0; nr < MAX_RT_PRIO; nr += sizeof(array->bitmap[0])*8) {
		pr_info("  bitmap idx %u: %lx\n", nr, 
array->bitmap[nr/(sizeof(array->bitmap[0])*8)]);
	}

	// check that the bitmap and array match
	for (nr = 0; nr < MAX_RT_PRIO; nr += 1) {
		bool l_empty = list_empty(array->queue + nr);
		bool a_empty = !test_bit(nr, array->bitmap);

		if (l_empty != a_empty) {
			pr_err(" bitmap idx %u: array %s, bitmask %s\n", nr,
			       a_empty ? "empty" : "full",
			       l_empty ? "empty" : "full");
		}
	}
}
	


-- 
Ben Dooks				http://www.codethink.co.uk/
Senior Engineer				Codethink - Providing Genius

https://www.codethink.co.uk/privacy.html