linux-kernel - Re: sched: observed instability under stress in 6.12 and mainline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <b7d9da0c-ee50-173b-3dc9-adcddc64e156@gmail.com>
Date: Mon, 13 Oct 2025 13:54:37 +0800
From: Hao Jia <jiahao.kernel@...il.com>
To: Jiping Ma <jiping.ma2@...driver.com>, kprateek.nayak@....com
Cc: chris.friesen@...driver.com, hanguangjiang@...iang.com,
 linux-kernel@...r.kernel.org, osandov@...com, peterz@...radead.org
Subject: Re: sched: observed instability under stress in 6.12 and mainline



On 2025/10/13 11:03, Jiping Ma wrote:
>>> Hi,
>>>
>>> I'd like to draw the attention of the scheduler maintainers to a number
>>> of kernel bugzilla reports submitted by a colleague a couple of weeks ago:
>>>
>>> 6.12.18:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=220447
>>> https://bugzilla.kernel.org/show_bug.cgi?id=220448
>>>
>>> v6.16-rt3
>>> https://bugzilla.kernel.org/show_bug.cgi?id=220450
>>> https://bugzilla.kernel.org/show_bug.cgi?id=220449
>>>
>>> There seems to be something wrong with either the logic or the locking.
>>> In one case this resulted in a NULL pointer dereference in
>>> pick_next_entity().  In another case it resulted in
>>> BUG_ON(!rq->nr_running) in dequeue_top_rt_rq() and
>>> SCHED_WARN_ON(!se->on_rq) in update_entity_lag().
>>>
>>> My colleague suggests that the NULL pointer dereference may be due to
>>> pick_eevdf() returning NULL in pick_next_entity().
>>>
>>> I did some digging and found that
>>> https://gitlab.com/linux-kernel/stable/-/commit/86b37810 would not have
>>> been included in 6.12.18, but the equivalent fix should have been in the
>>> 6.16 load.
>>>
>>> We haven't yet bottomed out the root cause.
>>>
>>> Any suggestions or assistance would be appreciated.
>>>
>>> Thanks,
>>> Chris
>>>
>>>
>>
>> Maybe this patch can be useful for your problem.
>> https://lore.kernel.org/all/tencent_3177343A3163451463643E434C61911B4208@qq.com/
>>
>> If I understand correctly, we may dequeue_entity twice in
>> rt_mutex_setprio()/__sched_setscheduler(). cfs_bandwidth may break the
>> state of p->on_rq and se->on_rq.
> 
> Thank veruy much!
> https://lore.kernel.org/all/tencent_3177343A3163451463643E434C61911B4208@qq.com/ can fix the original panic
> https://bugzilla.kernel.org/show_bug.cgi?id=220447, now we encounter the other !se->on_rq WARNING.  Do you know
> we already have the fix?
> 

Perhaps the following patch is more suitable for fixing the previous panic.

https://lore.kernel.org/all/105ae6f1-f629-4fe7-9644-4242c3bed035@amd.com/


This issue has been resolved in the latest kernel mainline by 
refactoring cfs_bandwidth.

As Peter mentioned, we need to submit a separate fix patch for the 
stable branch.

https://lore.kernel.org/all/20250929103836.GK3419281@noisy.programming.kicks-ass.net/

Thanks,
Hao