[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d6abff7f5f9ee5e41f19cb1f9d02de29@codethink.co.uk>
Date: Fri, 19 Sep 2025 18:37:15 +0200
From: Matteo Martelli <matteo.martelli@...ethink.co.uk>
To: Ben Dooks <ben.dooks@...ethink.co.uk>, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, Marcel Ziswiler
<marcel.ziswiler@...ethink.co.uk>, Matteo Martelli <matteo.martelli@...ethink.co.uk>
Subject: Re: BUG/WARN issues in kernel/sched/rt.c under stress-ng with
crgoup-v2
Hi all,
On Fri, 19 Sep 2025 12:10:34 +0100, Ben Dooks <ben.dooks@...ethink.co.uk> wrote:
> We are doing some testing with stress-ng and the cgroup-v2 enabled
> (CONFIG_RT_GROUP_SCHED) and are running into WARN/BUG within a minute
> related to user-space calling sched_setattr() and possibly other calls.
>
> At the moment we're not sure if the WARN and BUG calls are entirely
> correct, we are considering there may be some sort of race condition
> which is causing incorrect assumptions in the code.
>
> We are seeing this kernel bug in pick_next_rt_entity being triggered
>
> idx = sched_find_first_bit(array->bitmap);
> BUG_ON(idx >= MAX_RT_PRIO);
>
> Which suggests that the pick_task_rt() ran, thought there was something
> there to schedule and got into pick_next_rt_entity() which then found
> there was nothing. It does this by checking rq->rt.rt_queued before it
> bothers to try picking something to run.
>
> (this BUG_ON() is triggered if there is no index in the array indicating
> something there to run)
>
> We added some debug to find out what the values in pick_next_rt_entity()
> with the current rt_queued and the value it was when pick_task_rt()
> looked, and we got:
>
> idx 100 bigger than MAX_RT_PRIO 100, queued = 0 (queued was 1)
>
> This shows the code was entered with the rt_q showing something
> should have been queued and by the time the pick_next_rt_entity()
> was entered there seems to be nothing (assuming the array is in
> sync with the lists...)
>
> I think the two questions we have are:
>
> - Is the BUG_ON() here appropriate, should a WARN_ON_ONCE() and
> return NULL be the best way of handling this? I am going to try
> this and see if the system is still runnable with this.
>
> - Are we seeing a race here, and if so where is the best place to
> prevent it?
>
> Note, we do have a few local backported cgroup-v2 patches.
>
> Our systemd unit file to launch the test is here:
>
> [Service]
> Type=simple
> Restart=always
> ExecStartPre=/bin/sh -c 'echo 500000 >
> /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
> ExecStartPre=/bin/sh -c 'echo 500000 >
> /sys/fs/cgroup/system.slice/stress-sched-long-system.service/cpu.rt_runtime_us'
> ExecStart=sandbox-run /usr/bin/stress-ng --temp-path /tmp/stress-ng
> --timeout=0 --verify --oom-avoid --metrics --timestamp
> --exclude=enosys,usersyscall --cpu-sched 0 --timeout 60 --verbose
> --stressor-time
> Environment=SANDBOX_RO_BINDMOUNTS="/usr/share/stress-ng"
> Environment=SANDBOX_RW_BINDMOUNTS="/var/log /sys /proc /dev /tmp/stress-ng"
> Environment=SANDBOX_EXTRA_ARGS="--cwd /tmp/stress-ng --keep_caps
> --disable_rlimits --disable_clone_newuser"
> Slice=system.slice
> OOMPolicy=continue
>
> I added this to dump the array and confirm at-least the array-v-list
> was in sync at the point of the bug:
>
> static inline void debug_pick_next(struct rt_rq *rt_rq, int idx,
> unsigned qs)
> {
> struct rt_prio_array *array = &rt_rq->active;
> unsigned int nr;
>
> pr_err("rt_q %px: idx %d bigger than MAX_RT_PRIO %d, queued = %d (was
> %u)\n",
> rt_rq, idx, MAX_RT_PRIO, rt_rq->rt_queued, qs );
>
> for (nr = 0; nr < MAX_RT_PRIO; nr += sizeof(array->bitmap[0])*8) {
> pr_info(" bitmap idx %u: %lx\n", nr,
> array->bitmap[nr/(sizeof(array->bitmap[0])*8)]);
> }
>
> // check that the bitmap and array match
> for (nr = 0; nr < MAX_RT_PRIO; nr += 1) {
> bool l_empty = list_empty(array->queue + nr);
> bool a_empty = !test_bit(nr, array->bitmap);
>
> if (l_empty != a_empty) {
> pr_err(" bitmap idx %u: array %s, bitmask %s\n", nr,
> a_empty ? "empty" : "full",
> l_empty ? "empty" : "full");
> }
> }
> }
>
Hi all,
To provide some more context, we have found out this issue while running
some tests with stress-ng scheduler stressor[1] and the RT throttling
feature after enabling the RT_GROUP_SCHED kernel option. Note that we
also have PREEMPT_RT enabled in our config.
I've just reproduced the issue on qemu-x86_64 with a debian image and kernel
v6.17-rc6. See below the steps to reproduce it.
cd linux
git reset --hard v6.17-rc6 && git clean -f -d
# Apply patch to expose RT_GROUP_SCHED interface to userspace with cgroupv2
b4 shazam --single-message https://lore.kernel.org/all/20250731105543.40832-17-yurand2000@gmail.com/
# Build kernel with defconfig + PREEMPT_RT=y and RT_GROUP_SCHED=y
make mrproper
make defconfig
scripts/config -k -e EXPERT
scripts/config -k -e PREEMPT_RT
scripts/config -k -e RT_GROUP_SCHED
make olddefconfig
make -j12
# Download a debian image and run qemu
wget https://cdimage.debian.org/images/cloud/sid/daily/20250919-2240/debian-sid-nocloud-amd64-daily-20250919-2240.qcow2
qemu-system-x86_64 \
-m 2G -smp 4 \
-nographic \
-nic user,hostfwd=tcp::2222-:22 \
-M q35,accel=kvm \
-drive format=qcow2,file=debian-sid-nocloud-amd64-daily-20250919-2240.qcow2 \
-virtfs local,path=.,mount_tag=shared,security_model=mapped-xattr \
-monitor none \
-append "root=/dev/sda1 console=ttyS0,115200 sysctl.kernel.panic_on_oops=1" \
-kernel arch/x86/boot/bzImage
# Then inside guest machine
# Install stress-ng
apt-get update && apt-get install stress-ng
# Create the stress-ng service. It sets the group RT runtime to 500ms
# (50% BW) via the cgroupv2 interface then it starts the stress-ng
# scheduler stressor. Also note the cpu affinity set to a single CPU
# which seems to help the issue to be more reproducible.
echo "[Unit]
Description=Mixed stress with long in the system slice
After=basic.target
[Service]
AllowedCPUs=0
Type=simple
Restart=always
ExecStartPre=/bin/sh -c 'echo 500000 > /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
ExecStart=/usr/bin/stress-ng --timeout=0 --verify --oom-avoid --metrics --timestamp --exclude=enosys,usersyscall --cpu-sched 0 --timeout 60 --verbose --stressor-time
Slice=system.slice
OOMPolicy=continue" > /etc/systemd/system/stress-sched-long-system.service
systemctl start stress-sched-long-system.service
Then the BUG_ON is triggered within a few minutes. See the following logs.
[ 345.657737] ------------[ cut here ]------------
[ 345.657741] kernel BUG at kernel/sched/rt.c:1673!
[ 345.657746] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[ 345.657749] CPU: 0 UID: 0 PID: 379 Comm: stress-ng-cpu-s Not tainted 6.17.0-rc6-00001-g6c9be1b0be15 #1 PREEMPT_{RT,(full)}
[ 345.657750] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.17.0-1-1 04/01/2014
[ 345.657751] RIP: 0010:pick_task_rt+0x6c/0x80
[ 345.657762] Code: 85 c0 74 16 48 8b 78 40 48 85 ff 75 c6 48 2d 80 01 00 00 c3 cc cc cc cc 31 c0 c3 cc cc cc cc f3 48 0f bc c0 83 f8 63 7
[ 345.657763] RSP: 0018:ffff95de8071bcf8 EFLAGS: 00010002
[ 345.657765] RAX: 0000000000000064 RBX: ffff8bd585ab9e00 RCX: 0000000000000000
[ 345.657765] RDX: 0000000000000000 RSI: ffff8bd585ab9e00 RDI: ffff8bd5fdc29400
[ 345.657766] RBP: ffff95de8071bd70 R08: 0000000000000004 R09: ffff8bd5fdc29200
[ 345.657766] R10: 0000000000000001 R11: 000000000000000a R12: ffff8bd585ab9e00
[ 345.657767] R13: ffffffff97593180 R14: ffff8bd5fdc29200 R15: 0000000000000000
[ 345.657770] FS: 00007f339538fb00(0000) GS:ffff8bd665a0f000(0000) knlGS:0000000000000000
[ 345.657770] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 345.657771] CR2: 000056316fe06320 CR3: 0000000006f0e000 CR4: 00000000000006f0
[ 345.657772] Call Trace:
[ 345.657775] <TASK>
[ 345.657775] __schedule+0x488/0xf30
[ 345.657779] preempt_schedule+0x2e/0x50
[ 345.657780] preempt_schedule_thunk+0x16/0x30
[ 345.657782] migrate_enable+0xbc/0xd0
[ 345.657784] rt_spin_unlock+0xd/0x40
[ 345.657787] get_signal+0x765/0x8d0
[ 345.657789] ? do_nanosleep+0xe9/0x170
[ 345.657791] arch_do_signal_or_restart+0x38/0x250
[ 345.657793] exit_to_user_mode_loop+0x6b/0xb0
[ 345.657796] do_syscall_64+0x221/0x290
[ 345.657798] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 345.657800] RIP: 0033:0x7f3395c4f687
[ 345.657801] Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff
[ 345.657802] RSP: 002b:00007ffd2cbc0270 EFLAGS: 00000202 ORIG_RAX: 00000000000000e6
[ 345.657803] RAX: 0000000000000000 RBX: 00007f339538fb00 RCX: 00007f3395c4f687
[ 345.657803] RDX: 00007ffd2cbc02b0 RSI: 0000000000000000 RDI: 0000000000000000
[ 345.657804] RBP: 000056316fe06320 R08: 0000000000000000 R09: 0000000000000000
[ 345.657804] R10: 00007ffd2cbc02c0 R11: 0000000000000202 R12: 000000000000017b
[ 345.657804] R13: 000056315f837030 R14: 0000000000000003 R15: 0000000000000001
[ 345.657805] </TASK>
[ 345.657805] Modules linked in:
[ 345.657807] ---[ end trace 0000000000000000 ]---
[ 345.657807] RIP: 0010:pick_task_rt+0x6c/0x80
[ 345.657809] Code: 85 c0 74 16 48 8b 78 40 48 85 ff 75 c6 48 2d 80 01 00 00 c3 cc cc cc cc 31 c0 c3 cc cc cc cc f3 48 0f bc c0 83 f8 63 7e c0 90 <0f> 0b 90 0f 0b 90 31 c0 c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90
[ 345.657809] RSP: 0018:ffff95de8071bcf8 EFLAGS: 00010002
[ 345.657810] RAX: 0000000000000064 RBX: ffff8bd585ab9e00 RCX: 0000000000000000
[ 345.657810] RDX: 0000000000000000 RSI: ffff8bd585ab9e00 RDI: ffff8bd5fdc29400
[ 345.657810] RBP: ffff95de8071bd70 R08: 0000000000000004 R09: ffff8bd5fdc29200
[ 345.657811] R10: 0000000000000001 R11: 000000000000000a R12: ffff8bd585ab9e00
[ 345.657811] R13: ffffffff97593180 R14: ffff8bd5fdc29200 R15: 0000000000000000
[ 345.657814] FS: 00007f339538fb00(0000) GS:ffff8bd665a0f000(0000) knlGS:0000000000000000
[ 345.657814] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 345.657815] CR2: 000056316fe06320 CR3: 0000000006f0e000 CR4: 00000000000006f0
[ 345.657815] Kernel panic - not syncing: Fatal exception
[ 345.657969] Kernel Offset: 0x14e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 345.685385] ---[ end Kernel panic - not syncing: Fatal exception ]---
Also the WARNING in __dequeue_rt_entity() is often being hit
[ 117.550503] ------------[ cut here ]------------
[ 117.550505] WARNING: CPU: 0 PID: 398 at kernel/sched/rt.c:1366 dequeue_rt_stack+0x311/0x330
[ 117.550518] Modules linked in:
[ 117.550521] CPU: 0 UID: 0 PID: 398 Comm: stress-ng-cpu-s Not tainted 6.17.0-rc6-00001-g6c9be1b0be15 #1 PREEMPT_{RT,(full)}
[ 117.550523] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.17.0-1-1 04/01/2014
[ 117.550524] RIP: 0010:dequeue_rt_stack+0x311/0x330
[ 117.550526] Code: 06 00 00 e9 46 fe ff ff 90 0f 0b 90 85 c0 75 06 90 0f 0b 90 31 c0 b9 01 00 00 00 48 85 d2 0f 85 ce fd ff ff e9 cf fd f
[ 117.550526] RSP: 0018:ffffb4af008cbce0 EFLAGS: 00010046
[ 117.550528] RAX: 0000000000000000 RBX: ffff979604108120 RCX: ffff979601febc80
[ 117.550528] RDX: ffff979604108120 RSI: 0000000000000006 RDI: ffff97967dc29400
[ 117.550529] RBP: ffff979604108120 R08: 00000000000e7ef0 R09: ffff979601febc00
[ 117.550529] R10: 0000000000000001 R11: 0000000000000002 R12: ffff97967dc29400
[ 117.550530] R13: 0000000000000006 R14: 0000000000000002 R15: ffffffff97393180
[ 117.550533] FS: 00007f2a872eeb00(0000) GS:ffff9796e5c0f000(0000) knlGS:0000000000000000
[ 117.550534] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 117.550534] CR2: 00005633eab27d00 CR3: 00000000041c6000 CR4: 00000000000006f0
[ 117.550535] Call Trace:
[ 117.550538] <TASK>
[ 117.550539] dequeue_rt_entity+0x29/0x160
[ 117.550541] dequeue_task_rt+0x25/0x40
[ 117.550542] rt_mutex_setprio+0x356/0x520
[ 117.550545] rt_mutex_slowunlock+0x15c/0x290
[ 117.550548] ? __set_cpus_allowed_ptr+0x5f/0xa0
[ 117.550549] ? migrate_enable+0x6a/0xd0
[ 117.550550] do_send_sig_info+0x61/0xa0
[ 117.550553] kill_pid_info_type+0x8d/0xa0
[ 117.550555] kill_something_info+0x16b/0x1a0
[ 117.550556] __x64_sys_kill+0x88/0xb0
[ 117.550557] do_syscall_64+0xa4/0x290
[ 117.550560] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 117.550561] RIP: 0033:0x7f2a87b5f007
[ 117.550562] Code: 48 83 c4 08 c3 66 0f 1f 44 00 00 48 8b 15 e9 6d 1a 00 64 89 02 b8 ff ff ff ff eb e4 0f 1f 80 00 00 00 00 b8 3e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 6d 1a 00 f7 d8 64 89 01 48
[ 117.550563] RSP: 002b:00007ffefab861c8 EFLAGS: 00000202 ORIG_RAX: 000000000000003e
[ 117.550564] RAX: ffffffffffffffda RBX: 00007f2a872d8a00 RCX: 00007f2a87b5f007
[ 117.550565] RDX: 0000000000000003 RSI: 0000000000000012 RDI: 00000000000001af
[ 117.550565] RBP: 0000000000000002 R08: 000acce4c998f093 R09: 0000000000000000
[ 117.550565] R10: 00007f2a8909d000 R11: 0000000000000202 R12: 0000000000000004
[ 117.550566] R13: 0000000000000001 R14: 00007ffefab86420 R15: 00000000000001af
[ 117.550567] </TASK>
[ 117.550567] ---[ end trace 0000000000000000 ]---
and sometimes RCU stalls
[ 453.738633] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 453.738636] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-3): P853/3:b..l
[ 453.738638] rcu: (detected by 0, t=21002 jiffies, g=1477, q=122 ncpus=4)
[ 453.738640] task:stress-ng-cpu-s state:R running task stack:14200 pid:853 tgid:853 ppid:849 task_flags:0x400140 flags:0x0000
[ 453.738644] Call Trace:
[ 453.738645] <TASK>
[ 453.738646] __schedule+0x3c9/0xf30
[ 453.738651] schedule_rtlock+0x15/0x30
[ 453.738652] rtlock_slowlock_locked+0x1b6/0x1090
[ 453.738654] rt_spin_lock+0x79/0xd0
[ 453.738656] do_send_sig_info+0x31/0xa0
[ 453.738659] kill_pid_info_type+0x8d/0xa0
[ 453.738661] kill_something_info+0x16b/0x1a0
[ 453.738662] __x64_sys_kill+0x88/0xb0
[ 453.738663] do_syscall_64+0xa4/0x290
[ 453.738665] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 453.738667] RIP: 0033:0x7f4df5aeb007
[ 453.738669] RSP: 002b:00007ffffc9d8558 EFLAGS: 00000202 ORIG_RAX: 000000000000003e
[ 453.738670] RAX: ffffffffffffffda RBX: 00007f4df5261b20 RCX: 00007f4df5aeb007
[ 453.738671] RDX: 0000000000000012 RSI: 0000000000000012 RDI: 000000000000035e
[ 453.738671] RBP: 0000000000000003 R08: 001053484f787e79 R09: 0000000000000000
[ 453.738672] R10: 00007f4df7029000 R11: 0000000000000202 R12: 0000000000000004
[ 453.738672] R13: 0000000000000001 R14: 00007ffffc9d8738 R15: 000000000000035e
[ 453.738673] </TASK>
[1]: https://github.com/ColinIanKing/stress-ng/blob/master/stress-cpu-sched.c
I hope the additional information is helpful.
Best regards,
Matteo Martelli
Powered by blists - more mailing lists