linux-kernel - Re: BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d6fcbedf0daf259e2f96a1e0cc666cff@codethink.co.uk>
Date: Wed, 24 Sep 2025 15:10:19 +0200
From: Matteo Martelli <matteo.martelli@...ethink.co.uk>
To: Dietmar Eggemann <dietmar.eggemann@....com>, Ben Dooks
	<ben.dooks@...ethink.co.uk>, Ingo Molnar <mingo@...hat.com>, Peter Zijlstra
	<peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, Marcel Ziswiler
	<marcel.ziswiler@...ethink.co.uk>, Matteo Martelli
        <matteo.martelli@...ethink.co.uk>
Subject: Re: BUG/WARN issues in kernel/sched/rt.c under stress-ng with
 crgoup-v2

Hi Dietmar,

On Tue, 23 Sep 2025 20:14:18 +0200, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
> On 19.09.25 18:37, Matteo Martelli wrote:
> > Hi all,
> > 
> > On Fri, 19 Sep 2025 12:10:34 +0100, Ben Dooks <ben.dooks@...ethink.co.uk> wrote:
> >> We are doing some testing with stress-ng and the cgroup-v2 enabled
> >> (CONFIG_RT_GROUP_SCHED) and are running into WARN/BUG within a minute
> >> related to user-space calling sched_setattr() and possibly other calls.
> >>
> >> At the moment we're not sure if the WARN and BUG calls are entirely
> >> correct, we are considering there may be some sort of race condition
> >> which is causing incorrect assumptions in the code.
> >>
> >> We are seeing this kernel bug in pick_next_rt_entity being triggered
> >>
> >> 	idx = sched_find_first_bit(array->bitmap);
> >> 	BUG_ON(idx >= MAX_RT_PRIO);
> >>
> >> Which suggests that the pick_task_rt() ran, thought there was something
> >> there to schedule and got into pick_next_rt_entity() which then found
> >> there was nothing. It does this by checking rq->rt.rt_queued before it
> >> bothers to try picking something to run.
> >>
> >> (this BUG_ON() is triggered if there is no index in the array indicating
> >>   something there to run)
> >>
> >> We added some debug to find out what the values in pick_next_rt_entity()
> >> with the current rt_queued and the value it was when pick_task_rt()
> >> looked, and we got:
> >>
> >>     idx 100 bigger than MAX_RT_PRIO 100, queued = 0 (queued was 1)
> >>
> >> This shows the code was entered with the rt_q showing something
> >> should have been queued and by the time the pick_next_rt_entity()
> >> was entered there seems to be nothing (assuming the array is in
> >> sync with the lists...)
> >>
> >> I think the two questions we have are:
> >>
> >> - Is the BUG_ON() here appropriate, should a WARN_ON_ONCE() and
> >>    return NULL be the best way of handling this? I am going to try
> >>    this and see if the system is still runnable with this.
> >>
> >> - Are we seeing a race here, and if so where is the best place to
> >>    prevent it?
> >>
> >> Note, we do have a few local backported cgroup-v2 patches.
> >>
> >> Our systemd unit file to launch the test is here:
> >>
> >> [Service]
> >> Type=simple
> >> Restart=always
> >> ExecStartPre=/bin/sh -c 'echo 500000 > 
> >> /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
> >> ExecStartPre=/bin/sh -c 'echo 500000 > 
> >> /sys/fs/cgroup/system.slice/stress-sched-long-system.service/cpu.rt_runtime_us'
> >> ExecStart=sandbox-run /usr/bin/stress-ng --temp-path /tmp/stress-ng 
> >> --timeout=0 --verify --oom-avoid --metrics --timestamp 
> >> --exclude=enosys,usersyscall --cpu-sched 0 --timeout 60 --verbose 
> >> --stressor-time
> >> Environment=SANDBOX_RO_BINDMOUNTS="/usr/share/stress-ng"
> >> Environment=SANDBOX_RW_BINDMOUNTS="/var/log /sys /proc /dev /tmp/stress-ng"
> >> Environment=SANDBOX_EXTRA_ARGS="--cwd /tmp/stress-ng --keep_caps 
> >> --disable_rlimits --disable_clone_newuser"
> >> Slice=system.slice
> >> OOMPolicy=continue
> 
> [...]
> 
> > Hi all,
> > 
> > To provide some more context, we have found out this issue while running
> > some tests with stress-ng scheduler stressor[1] and the RT throttling
> > feature after enabling the RT_GROUP_SCHED kernel option. Note that we
> > also have PREEMPT_RT enabled in our config.
> > 
> > I've just reproduced the issue on qemu-x86_64 with a debian image and kernel
> > v6.17-rc6. See below the steps to reproduce it.
> > 
> > cd linux
> > git reset --hard v6.17-rc6 && git clean -f -d
> > 
> > # Apply patch to expose RT_GROUP_SCHED interface to userspace with cgroupv2
> > b4 shazam --single-message https://lore.kernel.org/all/20250731105543.40832-17-yurand2000@gmail.com/
> 
> Don't get this one ... you just pick a single patch from the RFC
> patch-set '[RFC PATCH v2 00/25]  Hierarchical Constant Bandwidth Server' ?
> 
> https://lore.kernel.org/r/20250731105543.40832-1-yurand2000@gmail.com
> 

Yes, I was looking for a way to set the cpu.rt_runtime_us param for a
specific cgroup from a systemd unit, in order to control the max CPU
bandwidth allowed for a systemd slice. Since systemd depracated support
for cgroupv1 I picked that patch to export them via cgroupv2. To my
understanding, with that patch, setting the rt_runtime_us and
rt_period_us parameters via cgroupv2 should have the same effect as
setting them via cgroupv1. Of course I could have missed something and
that could be one reason for the issue. I will better look into it and
try to see if the issue is still reproducible with cgroupv1.

> 
> > # Build kernel with defconfig + PREEMPT_RT=y and RT_GROUP_SCHED=y
> > make mrproper
> > make defconfig
> > scripts/config -k -e EXPERT
> > scripts/config -k -e PREEMPT_RT
> > scripts/config -k -e RT_GROUP_SCHED
> > make olddefconfig
> > make -j12
> > 
> > # Download a debian image and run qemu
> > wget https://cdimage.debian.org/images/cloud/sid/daily/20250919-2240/debian-sid-nocloud-amd64-daily-20250919-2240.qcow2
> > qemu-system-x86_64 \
> >     -m 2G -smp 4 \
> >     -nographic \
> >     -nic user,hostfwd=tcp::2222-:22 \
> >     -M q35,accel=kvm \
> >     -drive format=qcow2,file=debian-sid-nocloud-amd64-daily-20250919-2240.qcow2 \
> >     -virtfs local,path=.,mount_tag=shared,security_model=mapped-xattr \
> >     -monitor none \
> >     -append "root=/dev/sda1 console=ttyS0,115200 sysctl.kernel.panic_on_oops=1" \
> >     -kernel arch/x86/boot/bzImage
> > 
> > # Then inside guest machine
> > # Install stress-ng
> > apt-get update && apt-get install stress-ng
> > 
> > # Create the stress-ng service. It sets the group RT runtime to 500ms
> > # (50% BW) via the cgroupv2 interface then it starts the stress-ng
> > # scheduler stressor. Also note the cpu affinity set to a single CPU
> > # which seems to help the issue to be more reproducible.
> 
> I assume this is the 'AllowedCPUs=0' line in the systemd service file.

Yes, correct.

> 
> > echo "[Unit]
> > Description=Mixed stress with long in the system slice
> > After=basic.target
> > 
> > [Service]
> > AllowedCPUs=0
> > Type=simple
> > Restart=always
> > ExecStartPre=/bin/sh -c 'echo 500000 > /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
> > ExecStart=/usr/bin/stress-ng --timeout=0 --verify --oom-avoid --metrics --timestamp --exclude=enosys,usersyscall --cpu-sched 0 --
> 
> 
> I assume you get 4 stressors since you run 'qemu -smp 4'? How many
> stress-ng related tasks have you running in
> 'system.slice/stress-sched-long-system.service'? And all of them on CPU0?

Yes, with --cpu-sched 0, stress-ng is using 4 scheduler stressors all
running on CPU 0. To my understanding each scheduler stressor forks 16
stress-ng child tasks [1], this is confirmed by the number of stress-ng
tasks running on the system. The test itself is not particularly
meaningful, it just reflects the setup I had when I found the BUG_ON.

> [...]
> 

[1]: https://github.com/ColinIanKing/stress-ng/blob/V0.19.04/stress-cpu-sched.c#L66

Best regards,
Matteo Martelli