linux-kernel - Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aMq-BKLSIG9JrRb7@jlelli-thinkpadt14gen4.remote.csb>
Date: Wed, 17 Sep 2025 15:56:20 +0200
From: Juri Lelli <juri.lelli@...hat.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: John Stultz <jstultz@...gle.com>, LKML <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Valentin Schneider <vschneid@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Xuewen Yan <xuewen.yan94@...il.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Suleiman Souhlal <suleiman@...gle.com>,
	Qais Yousef <qyousef@...alina.io>,
	Joel Fernandes <joelagnelf@...dia.com>,
	kuyo chang <kuyo.chang@...iatek.com>, hupu <hupu.gm@...il.com>,
	kernel-team@...roid.com
Subject: Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck,
 allowing cpu starvation

On 17/09/25 14:26, Peter Zijlstra wrote:
> On Wed, Sep 17, 2025 at 11:34:42AM +0200, Peter Zijlstra wrote:
> 
> > Yes. This makes sense.
> > 
> > The old code would disable the dl_server when fair tasks drops to 0
> > so even though we had that yield in __pick_task_dl(), we'd never hit it.
> > So the moment another fair task shows up (0->1) we re-enqueue the
> > dl_server (using update_dl_entity() / CBS wakeup rules) and continue
> > consuming bandwidth.
> > 
> > However, since we're now not stopping the thing, we hit that yield,
> > getting this pretty terrible behaviour where we will only run fair tasks
> > until there are none and then yield our entire period, forcing another
> > task to wait until the next cycle.
> > 
> > Let me go have a play, surely we can do better.
> 
> Can you please try:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent
> 
> That's yesterdays patch and the below. Its compile tested only, but
> with a bit of luck it'll actually work ;-)
> 
> ---
> Subject: sched/deadline: Fix dl_server behaviour
> From: Peter Zijlstra <peterz@...radead.org>
> Date: Wed Sep 17 12:03:20 CEST 2025
> 
> John reported undesirable behaviour with the dl_server since commit:
> cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling").
> 
> When starving fair tasks on purpose (starting spinning FIFO tasks),
> his fair workload, which often goes (briefly) idle, would delay fair
> invocations for a second, running one invocation per second was both
> unexpected and terribly slow.
> 
> The reason this happens is that when dl_se->server_pick_task() returns
> NULL, indicating no runnable tasks, it would yield, pushing any later
> jobs out a whole period (1 second).
> 
> Instead simply stop the server. This should restore behaviour in that
> a later wakeup (which restarts the server) will be able to continue
> running (subject to the CBS wakeup rules).
> 
> Notably, this does not re-introduce the behaviour cccb45d7c4295 set
> out to solve, any start/stop cycle is naturally throttled by the timer
> period (no active cancel).

Neat!

> Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
> Reported-by: John Stultz <jstultz@...gle.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> ---

...

> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -371,10 +371,39 @@ extern s64 dl_scaled_delta_exec(struct r
>   *   dl_server_update() -- called from update_curr_common(), propagates runtime
>   *                         to the server.
>   *
> - *   dl_server_start()
> - *   dl_server_stop()  -- start/stop the server when it has (no) tasks.
> + *   dl_server_start() -- start the server when it has tasks; it will stop
> + *			  automatically when there are no more tasks, per
> + *			  dl_se::server_pick() returning NULL.
> + *
> + *   dl_server_stop() -- (force) stop the server; use when updating
> + *                       parameters.
>   *
>   *   dl_server_init() -- initializes the server.
> + *
> + * When started the dl_server will (per dl_defer) schedule a timer for its
> + * zero-laxity point -- that is, unlike regular EDF tasks which run ASAP, a
> + * server will run at the very end of its period.
> + *
> + * This is done such that any runtime from the target class can be accounted
> + * against the server -- through dl_server_update() above -- such that when it
> + * becomes time to run, it might already be out of runtime and get deferred
> + * until the next period. In this case dl_server_timer() will alternate
> + * between defer and replenish but never actually enqueue the server.
> + *
> + * Only when the target class does not manage to exhaust the server's runtime
> + * (there's actualy starvation in the given period), will the dl_server get on
> + * the runqueue. Once queued it will pick tasks from the target class and run
> + * them until either its runtime is exhaused, at which point its back to
> + * dl_server_timer, or until there are no more tasks to run, at which point
> + * the dl_server stops itself.
> + *
> + * By stopping at this point the dl_server retains bandwidth, which, if a new
> + * task wakes up imminently (starting the server again), can be used --
> + * subject to CBS wakeup rules -- without having to wait for the next period.

In both cases we still defer until either the new period or the current
0-laxity, right?

The stop cleans all the flags, so subsequent start calls
enqueue(ENQUEUE_WAKEUP) -> update_dl_entity() which sets dl_throttled
and dl_defer_armed in both cases and then we start_dl_timer (defer
timer) after it (without enqueueing right away).

Or maybe I am still a bit lost. :)

> + * Additionally, because of the dl_defer behaviour the start/stop behaviour is
> + * naturally thottled to once per period, avoiding high context switch
> + * workloads from spamming the hrtimer program/cancel paths.

Right. Also nice cleanup of a flag and a method.