linux-kernel - Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCqK3VBAxxWMsDez8xkX0vcTStWjRMR95pksUM6Q26Ctyw@mail.gmail.com>
Date: Tue, 16 Sep 2025 20:29:01 -0700
From: John Stultz <jstultz@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Juri Lelli <juri.lelli@...hat.com>, LKML <linux-kernel@...r.kernel.org>, 
	Ingo Molnar <mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Valentin Schneider <vschneid@...hat.com>, 
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Xuewen Yan <xuewen.yan94@...il.com>, K Prateek Nayak <kprateek.nayak@....com>, 
	Suleiman Souhlal <suleiman@...gle.com>, Qais Yousef <qyousef@...alina.io>, 
	Joel Fernandes <joelagnelf@...dia.com>, kuyo chang <kuyo.chang@...iatek.com>, 
	hupu <hupu.gm@...il.com>, kernel-team@...roid.com
Subject: Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck,
 allowing cpu starvation

On Tue, Sep 16, 2025 at 2:30 PM Peter Zijlstra <peterz@...radead.org> wrote:
> On Tue, Sep 16, 2025 at 10:35:46AM -0700, John Stultz wrote:
> > I of course still see the thread spawning issues with my
> > ksched_football test that come from keeping the dl_server running for
> > the whole period, but that's a separate thing I'm trying to isolate.
>
> So what happens with that football thing -- you're saying thread
> spawning issues, so kthread_create() is not making progress?
>
> You're also saying 'keeping the dl_server running for the whole period',
> are you seeing it not being limited to its allocated bandwidth?
>
> Per the bug today, the dl_server does get throttled, not sufficiently?

So yeah, I think this is a totally different issue than the lockup
warning problem, and again, I suspect my test is at least partially
culpable.

You can find my test here:
https://github.com/johnstultz-work/linux-dev/commit/d4c36a62444241558947e22af5a972a0859e031a

The idea is we first start a "Referee" RT prio 20 task, and from that
task we spawn NR_CPU  RT prio 5 "Offense players" which just spin
incrementing a value ("the ball"), then NR_CPU RT prio 10 "Defense
players" that just spin (conceptually blocking the Offense players
from running on the cpu). Once everyone is running, the ref zeros the
ball and sleeps for a few seconds. When it wakes it checks the ball is
still zero and shuts the test down.

So the point of this test is very much for the defense threads to
starve the offense from the cpu.  In doing so, it is helpful for
validating the RT scheduling invariant that we always (heh, well...
ignoring RT_PUSH_IPI, but that's yet a different issue) run the top
NR_CPU RT priority tasks across the cpus. So it inherently stresses
rt-throttling as well.

Note: In the test there are also RT prio 15 "crazy-fans" that sleep,
but occasionally wake up and "streak across the field" trying to
disrupt the game, as well as low priority defense players that take
either proxy-mutex or rt_mutexes that the high priority defense
threads try to aquire to ensure proxy-exec & priority inheritance
boosting is working - but these can be mostly ignored for this
problem.

Each time the ref spawns a player type, we go from lowest to highest
priority, calling kthread_create(), sched_setattr_nocheck() and then
wake_up_process(), for NR_CPUs, and then the Ref repeatedly sleeps
waiting for NR_CPU tasks to check-in (incrementing a player counter).

The problem is that previously the tasks were spawned without much
delay, but after your change, we see each RT tasks after the first
NR_CPU take ~1 second to spawn. On systems with large cpu counts
taking NR_CPU*player-types seconds to start can be long enough (easily
minutes) that we hit hung task errors.

As I've been digging, part of the issue seems to be kthreadd is not an
RT task, thus after we create NR_CPU spinner threads, the following
threads don't actually start right away. They have to wait until
rt-throttling or the dl_server runs, allowing kthreadd to get cpu time
and start the thread.  But then we have to wait until throttling stops
so that the RT prio Ref thread can run to set the new thread as RT
prio and then spawn the next, which then won't start until we do
throttling again, etc.

So it makes sense if the dl_server is "refreshed" once a second, we
might see this one thread per second interlock happening.  So this
seems likely just an issue with my test.

However, prior to your change, it seemed like the the RT tasks would
be throttled, kthreadd would run and the thread would start, then with
no SCHED_NORMAL tasks to run, the dl_server would stop and we'd run
the Ref, which would immediately eneueued kthreadd, which started the
dl_server and since the dl_server hadn't done much, I'm guessing it
still had some runtime/bandwidth and would run kthreadd then stop
right after.

So now my guess (and my apologies, as I really am not familiar enough
the the DL details) is that since we're not stopping the dl_server, it
seems like the dl_server runtime gets used up each cycle, even though
it didn't do much boosting. So we have to wait the full cycle for a
refresh.

I've used a slightly modified version of my test (simplifying it a bit
and adding extra trace annotations) to capture the following traces:

New behavior: https://ui.perfetto.dev/#!/?s=663e6a8f4f0ecdbc2bbde3efdc601dc980f7b333
This shows the clear 1 second delay interlock with kthreadd for each
thread after the first two, creating a clear stair-step pattern.

Old behavior: https://ui.perfetto.dev/#!/?s=3d6a70716c6a66eb8d7056ef2c640838404186c2
This shows how there's a 1 second delay after the first two threads
spawn, but then the rest of the created threads quickly interleave
with kthreadd, allowing them to start off quickly

I still need to add some temporary trace events around the dl_server
so I can better understand how that interleaving used to work. As it
may have been just luck that my test worked back then.

However, interestingly if we go back to 6.11, before the dl_server
landed, it's curious because I do see the 1 second stair steps, but
with larger cpu counts there seems to be more of them working in
parallel, so each second, the ref will manage to interlock with
kthreadd 4 to 8 times, and that many of threads will start up.  So its
behavior isn't exactly like either the new or old behavior with
dl_server, but it escaped hitting the hung task errors.

But as the new dl_server logic is much slower for spawning rt threads,
I probably need to find a new way to start my test. But I just want to
raise the issue in case others hit similar problems with the change.

thanks
-john