linux-kernel - sched: observed instability under stress in 6.12 and mainline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <87254ef1-fa58-4747-b2e1-5c85ecde15bf@windriver.com>
Date: Thu, 4 Sep 2025 10:33:20 -0600
From: Chris Friesen <chris.friesen@...driver.com>
To: LKML <linux-kernel@...r.kernel.org>
Cc: osandov@...com, Peter Zijlstra <peterz@...radead.org>
Subject: sched: observed instability under stress in 6.12 and mainline

Hi,

I'd like to draw the attention of the scheduler maintainers to a number 
of kernel bugzilla reports submitted by a colleague a couple of weeks ago:

6.12.18:
https://bugzilla.kernel.org/show_bug.cgi?id=220447
https://bugzilla.kernel.org/show_bug.cgi?id=220448

v6.16-rt3
https://bugzilla.kernel.org/show_bug.cgi?id=220450
https://bugzilla.kernel.org/show_bug.cgi?id=220449

There seems to be something wrong with either the logic or the locking. 
In one case this resulted in a NULL pointer dereference in 
pick_next_entity().  In another case it resulted in 
BUG_ON(!rq->nr_running) in dequeue_top_rt_rq() and 
SCHED_WARN_ON(!se->on_rq) in update_entity_lag().

My colleague suggests that the NULL pointer dereference may be due to 
pick_eevdf() returning NULL in pick_next_entity().

I did some digging and found that 
https://gitlab.com/linux-kernel/stable/-/commit/86b37810 would not have 
been included in 6.12.18, but the equivalent fix should have been in the 
6.16 load.

We haven't yet bottomed out the root cause.

Any suggestions or assistance would be appreciated.

Thanks,
Chris