linux-kernel - Re: [PATCH v2 02/12] sched/deadline: Less agressive dl

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCpDW-fg6DK8Wcgwq-1fgaYSkxL1G6ChUkC4K6Mpk04aEQ@mail.gmail.com>
Date: Mon, 15 Sep 2025 15:29:10 -0700
From: John Stultz <jstultz@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, vschneid@...hat.com, clm@...a.com, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling

On Wed, Jul 2, 2025 at 4:49 AM Peter Zijlstra <peterz@...radead.org> wrote:
>
> Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
> bandwidth control") caused a significant dip in his favourite
> benchmark of the day. Simply disabling dl_server cured things.
>
> His workload hammers the 0->1, 1->0 transitions, and the
> dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
> idea in hind sight and all that.
>
> Change things around to only disable the dl_server when there has not
> been a fair task around for a whole period. Since the default period
> is 1 second, this ensures the benchmark never trips this, overhead
> gone.
>
> Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server")
> Reported-by: Chris Mason <clm@...a.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> Link: https://lkml.kernel.org/r/20250520101727.507378961@infradead.org

So I know this patch has already had a couple of issues reported against it:
  [1] "sched: DL replenish lagged too much" warnings
  [2]  changing server parameters break per-runqueue running_bw tracking
  [3] dl_server_stopped() should return true if dl_se->dl_server_active is 0.

In fact, I reported [4] some trouble with one stress test
(ksched_football) I've been developing with the proxy-exec series
getting stuck because of the behavior change this brought. And while
I'm open to my test being problematic (it tries to generate 5*NR_CPU
spinning RT tasks, which starves kthreadd and with this change causes
the dl_server to only let one thread be spawned per second) it still
seemed concerning (as before this change the dl_server and
rt_throttling would let us generate the RT tasks faster - though that
may have been unintentional, I've not gotten my head totally around
the previous behavior, and unfortunately got distracted with other
work).

But separately testing out Peter's new cleanups for sched_ext (without
my problematic test), I started tripping over workqueue lockup BUGs in
certain situations, and I've found I can reproduce these pretty easily
with vanilla v6.17-rc6 alone (which includes the fixes for the
reported issues above).  I don't have CONFIG_SCHED_PROXY_EXEC enabled
for this.

If I run a 2 cpu qemu instance with locktorture enabled, booting with
the cmd arg:
"torture.random_shuffle=1 locktorture.writer_fifo=1
locktorture.torture_type=mutex_lock locktorture.nested_locks=8
locktorture.rt_boost=1 locktorture.rt_boost_factor=50
locktorture.stutter=0 "

Within ~7 minutes, I'll usually see something like:
[   92.301253] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0
nice=0 stuck for 42s!
[   92.305170] Showing busy workqueues and worker pools:
[   92.307434] workqueue events_power_efficient: flags=0x80
[   92.309796]   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
[   92.309834]     pending: neigh_managed_work
[   92.314565]   pwq 6: cpus=1 node=0 flags=0x0 nice=0 active=4 refcnt=5
[   92.314604]     pending: crda_timeout_work, neigh_managed_work,
neigh_periodic_work, gc_worker
[   92.321151] workqueue mm_percpu_wq: flags=0x8
[   92.323124]   pwq 6: cpus=1 node=0 flags=0x0 nice=0 active=1 refcnt=2
[   92.323161]     pending: vmstat_update
[   92.327638] workqueue kblockd: flags=0x18
[   92.329429]   pwq 7: cpus=1 node=0 flags=0x0 nice=-20 active=1 refcnt=2
[   92.329467]     pending: blk_mq_timeout_work
[   92.334259] Showing backtraces of running workers in stalled
CPU-bound worker pools:


And this will continue every 30 secs with the stuck time increasing:
[ 1411.472533] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0
nice=0 stuck for 1361s!
[ 1411.476404] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0
nice=-20 stuck for 1345s!
[ 1411.480214] Showing busy workqueues and worker pools:
[ 1411.482939] workqueue events_power_efficient: flags=0x80
[ 1411.486171]   pwq 6: cpus=1 node=0 flags=0x0 nice=0 active=6 refcnt=7
[ 1411.486220]     pending: crda_timeout_work, neigh_managed_work,
neigh_periodic_work, gc_worker, reg_check_chans_work, check_lifetime
[ 1411.497091] workqueue mm_percpu_wq: flags=0x8
[ 1411.499829]   pwq 6: cpus=1 node=0 flags=0x0 nice=0 active=1 refcnt=2
[ 1411.499878]     pending: vmstat_update
[ 1411.506010] workqueue kblockd: flags=0x18
[ 1411.508689]   pwq 7: cpus=1 node=0 flags=0x0 nice=-20 active=1 refcnt=2
[ 1411.508738]     pending: blk_mq_timeout_work
[ 1411.515311] workqueue mld: flags=0x40008
[ 1411.517764]   pwq 6: cpus=1 node=0 flags=0x0 nice=0 active=1 refcnt=2
[ 1411.517813]     pending: mld_ifc_work
[ 1411.523923] Showing backtraces of running workers in stalled
CPU-bound worker pools:

I bisected it down to this commit cccb45d7c4295 ("sched/deadline: Less
agressive dl_server handling"), and I found that reverting just the
dequeue_entities() changes (last two chunks of this patch), resolves
the lockup warnings.

That revert *also* resolves my ksched_football test problem, but I
suspect it's also because that change is the main point of this patch
(getting the costly dl_server_stop() out of the frequently used
dequeue_entities() path).

Now while locktorture will keep the cpu busy, it usually bounces
around a fair bit, but when I see the lockup warnings, it
lockturture_writer usually is pinning one cpu and stays that way,
where as with the revert I don't see things getting stuck running on
one cpu.  Adding a trace_printk() to __pick_task_dl() when we pick
from a dl_server, shows the dl_server running and picking tasks for
awhile, but then abruptly stopping shortly before we see the lockup
warnings, suggesting the dl_server somehow stopped and didn't get
started again.

>From my initial debugging, it seems like the dl_se->dl_server_active
is set, but the dl_se isn't enqueued, and because active is set we're
short cutting out of dl_server_start(). I suspect the revert re-adding
dl_server_stop() helps because it forces dl_server_active to be
cleared and so we keep the dl_server_active and enqueued state in
sync.   Maybe we are dequeueing the dl_server from update_curr_dl_se()
due to dl_runtime_exceeded() but somehow not re-enqueueing it via
dl_server_timer()?

So I am still digging into it, but I do think there's something still
amiss with this change.  If there's anything I should try to further
debug this, do let me know.

thanks
-john


[1] https://lore.kernel.org/lkml/CAMuHMdVPR7Q7pvn+QqGcq2pJ00apDgUcaCmAgq6nnM1uHySMcw@mail.gmail.com/
[2] https://lore.kernel.org/lkml/20250721-upstream-fix-dlserver-lessaggressive-b4-v1-1-4ebc10c87e40@redhat.com/
[3] https://lore.kernel.org/lkml/20250809130419.1980742-1-chenhuacai@loongson.cn/
[4] https://lore.kernel.org/lkml/CANDhNCo+G4_t8jYU-QNPz42uZsKdMgEmTnr8pYSKbgm26NJUCg@mail.gmail.com/