lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <3308bca2-624e-42a3-8d98-48751acaa3b3@codethink.co.uk>
Date: Fri, 19 Sep 2025 12:10:34 +0100
From: Ben Dooks <ben.dooks@...ethink.co.uk>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
 Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 Matteo Martelli <matteo.martelli@...ethink.co.uk>,
 Marcel Ziswiler <marcel.ziswiler@...ethink.co.uk>
Subject: BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2

We are doing some testing with stress-ng and the cgroup-v2 enabled
(CONFIG_RT_GROUP_SCHED) and are running into WARN/BUG within a minute
related to user-space calling sched_setattr() and possibly other calls.

At the moment we're not sure if the WARN and BUG calls are entirely
correct, we are considering there may be some sort of race condition
which is causing incorrect assumptions in the code.

We are seeing this kernel bug in pick_next_rt_entity being triggered

	idx = sched_find_first_bit(array->bitmap);
	BUG_ON(idx >= MAX_RT_PRIO);

Which suggests that the pick_task_rt() ran, thought there was something
there to schedule and got into pick_next_rt_entity() which then found
there was nothing. It does this by checking rq->rt.rt_queued before it
bothers to try picking something to run.

(this BUG_ON() is triggered if there is no index in the array indicating
  something there to run)

We added some debug to find out what the values in pick_next_rt_entity()
with the current rt_queued and the value it was when pick_task_rt()
looked, and we got:

    idx 100 bigger than MAX_RT_PRIO 100, queued = 0 (queued was 1)

This shows the code was entered with the rt_q showing something
should have been queued and by the time the pick_next_rt_entity()
was entered there seems to be nothing (assuming the array is in
sync with the lists...)

I think the two questions we have are:

- Is the BUG_ON() here appropriate, should a WARN_ON_ONCE() and
   return NULL be the best way of handling this? I am going to try
   this and see if the system is still runnable with this.

- Are we seeing a race here, and if so where is the best place to
   prevent it?

Note, we do have a few local backported cgroup-v2 patches.

Our systemd unit file to launch the test is here:

[Service]
Type=simple
Restart=always
ExecStartPre=/bin/sh -c 'echo 500000 > 
/sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
ExecStartPre=/bin/sh -c 'echo 500000 > 
/sys/fs/cgroup/system.slice/stress-sched-long-system.service/cpu.rt_runtime_us'
ExecStart=sandbox-run /usr/bin/stress-ng --temp-path /tmp/stress-ng 
--timeout=0 --verify --oom-avoid --metrics --timestamp 
--exclude=enosys,usersyscall --cpu-sched 0 --timeout 60 --verbose 
--stressor-time
Environment=SANDBOX_RO_BINDMOUNTS="/usr/share/stress-ng"
Environment=SANDBOX_RW_BINDMOUNTS="/var/log /sys /proc /dev /tmp/stress-ng"
Environment=SANDBOX_EXTRA_ARGS="--cwd /tmp/stress-ng --keep_caps 
--disable_rlimits --disable_clone_newuser"
Slice=system.slice
OOMPolicy=continue

I added this to dump the array and confirm at-least the array-v-list
was in sync at the point of the bug:

static inline void debug_pick_next(struct rt_rq *rt_rq, int idx, 
unsigned qs)
{
	struct rt_prio_array *array = &rt_rq->active;
	unsigned int nr;

	pr_err("rt_q %px: idx %d bigger than MAX_RT_PRIO %d, queued = %d (was 
%u)\n",
	       rt_rq, idx, MAX_RT_PRIO, rt_rq->rt_queued, qs );

	for (nr = 0; nr < MAX_RT_PRIO; nr += sizeof(array->bitmap[0])*8) {
		pr_info("  bitmap idx %u: %lx\n", nr, 
array->bitmap[nr/(sizeof(array->bitmap[0])*8)]);
	}

	// check that the bitmap and array match
	for (nr = 0; nr < MAX_RT_PRIO; nr += 1) {
		bool l_empty = list_empty(array->queue + nr);
		bool a_empty = !test_bit(nr, array->bitmap);

		if (l_empty != a_empty) {
			pr_err(" bitmap idx %u: array %s, bitmask %s\n", nr,
			       a_empty ? "empty" : "full",
			       l_empty ? "empty" : "full");
		}
	}
}
	


-- 
Ben Dooks				http://www.codethink.co.uk/
Senior Engineer				Codethink - Providing Genius

https://www.codethink.co.uk/privacy.html



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ