[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7cd74213-5654-aac0-54d0-4f4b1a7f0fef@gmail.com>
Date: Mon, 8 Sep 2025 09:51:15 +0800
From: Hao Jia <jiahao.kernel@...il.com>
To: Chris Friesen <chris.friesen@...driver.com>,
LKML <linux-kernel@...r.kernel.org>, hanguangjiang@...iang.com
Cc: osandov@...com, Peter Zijlstra <peterz@...radead.org>
Subject: Re: sched: observed instability under stress in 6.12 and mainline
On 2025/9/5 00:33, Chris Friesen wrote:
> Hi,
>
> I'd like to draw the attention of the scheduler maintainers to a number
> of kernel bugzilla reports submitted by a colleague a couple of weeks ago:
>
> 6.12.18:
> https://bugzilla.kernel.org/show_bug.cgi?id=220447
> https://bugzilla.kernel.org/show_bug.cgi?id=220448
>
> v6.16-rt3
> https://bugzilla.kernel.org/show_bug.cgi?id=220450
> https://bugzilla.kernel.org/show_bug.cgi?id=220449
>
> There seems to be something wrong with either the logic or the locking.
> In one case this resulted in a NULL pointer dereference in
> pick_next_entity(). In another case it resulted in
> BUG_ON(!rq->nr_running) in dequeue_top_rt_rq() and
> SCHED_WARN_ON(!se->on_rq) in update_entity_lag().
>
> My colleague suggests that the NULL pointer dereference may be due to
> pick_eevdf() returning NULL in pick_next_entity().
>
> I did some digging and found that
> https://gitlab.com/linux-kernel/stable/-/commit/86b37810 would not have
> been included in 6.12.18, but the equivalent fix should have been in the
> 6.16 load.
>
> We haven't yet bottomed out the root cause.
>
> Any suggestions or assistance would be appreciated.
>
> Thanks,
> Chris
>
>
Maybe this patch can be useful for your problem.
https://lore.kernel.org/all/tencent_3177343A3163451463643E434C61911B4208@qq.com/
If I understand correctly, we may dequeue_entity twice in
rt_mutex_setprio()/__sched_setscheduler(). cfs_bandwidth may break the
state of p->on_rq and se->on_rq.
Thanks,
Hao
Powered by blists - more mailing lists