linux-kernel - Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aMklFqjbmxMKszkJ@jlelli-thinkpadt14gen4.remote.csb>
Date: Tue, 16 Sep 2025 10:51:34 +0200
From: Juri Lelli <juri.lelli@...hat.com>
To: John Stultz <jstultz@...gle.com>
Cc: LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Valentin Schneider <vschneid@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Xuewen Yan <xuewen.yan94@...il.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Suleiman Souhlal <suleiman@...gle.com>,
	Qais Yousef <qyousef@...alina.io>,
	Joel Fernandes <joelagnelf@...dia.com>,
	kuyo chang <kuyo.chang@...iatek.com>, hupu <hupu.gm@...il.com>,
	kernel-team@...roid.com
Subject: Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck,
 allowing cpu starvation

Hi John,

Thanks a lot for looking into this!

On 16/09/25 05:28, John Stultz wrote:
> With 6.17-rc6, I found when running with locktorture enabled, on
> a two core qemu VM, I could easily hit some lockup warnings:
> 
> [   92.301253] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 42s!
> [   92.305170] Showing busy workqueues and worker pools:
> [   92.307434] workqueue events_power_efficient: flags=0x80
> [   92.309796]   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
> [   92.309834]     pending: neigh_managed_work
> [   92.314565]   pwq 6: cpus=1 node=0 flags=0x0 nice=0 active=4 refcnt=5
> [   92.314604]     pending: crda_timeout_work, neigh_managed_work, neigh_periodic_work, gc_worker
> [   92.321151] workqueue mm_percpu_wq: flags=0x8
> [   92.323124]   pwq 6: cpus=1 node=0 flags=0x0 nice=0 active=1 refcnt=2
> [   92.323161]     pending: vmstat_update
> [   92.327638] workqueue kblockd: flags=0x18
> [   92.329429]   pwq 7: cpus=1 node=0 flags=0x0 nice=-20 active=1 refcnt=2
> [   92.329467]     pending: blk_mq_timeout_work
> [   92.334259] Showing backtraces of running workers in stalled CPU-bound worker pools:
> 
> I bisected it down to commit cccb45d7c429 ("sched/deadline: Less
> agressive dl_server handling"), and in debugging it seems there
> is a chance where we end up with the dl_server dequeued, with
> dl_se->dl_server_active. This causes dl_server_start() to
> return without enqueueing the dl_server, thus it fails to run
> when RT tasks starve the cpu.
> 
> I found when this happens, the dl_timer hrtimer is set and calls
>  dl_server_timer(), which catches on the
>   `if (!dl_se->server_has_tasks(dl_se))`
> case, which then calls replenish_dl_entity() and
> dl_server_stopped() and finally returns HRTIMER_NORESTART.
> 
> The problem being, dl_server_stopped() will set
> dl_se->dl_server_idle before returning false (and notably not
> calling dl_server_stop() which would clear dl_server_active).
> 
> After this, we end up in a situation where the timer doesn't
> fire again. And nothing enqueues the dl_server entity back onto
> the runqueue, so it never picks from the fair sched and we see
> the starvation on that core.

Because dl_server_start() returns early if dl_server_active is set.
 
> So in dl_server_timer() call dl_server_stop() instead of
> dl_server_stopped(), as that will ensure dl_server_active
> gets cleared when we are dequeued.
> 
> Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
> Signed-off-by: John Stultz <jstultz@...gle.com>
> ---
> NOTE: I'm not confident this is the right fix, but I wanted
> to share for feedback and testing.
> 
> Also, this resolves the lockup warnings and problematic behavior
> I see with locktorture, but does *not* resolve the behavior
> change I hit with my ksched_football test (which intentionally
> causes RT starvation) that I bisected down to the same
> problematic change and mentioned here:
>   https://lore.kernel.org/lkml/20250722070600.3267819-1-jstultz@google.com/
> This may be just a problem with my test, but I'm still a bit
> wary that this behavior change may bite folks.
> 
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Juri Lelli <juri.lelli@...hat.com>
> Cc: Vincent Guittot <vincent.guittot@...aro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@....com>
> Cc: Valentin Schneider <vschneid@...hat.com>
> Cc: Steven Rostedt <rostedt@...dmis.org>
> Cc: Ben Segall <bsegall@...gle.com>
> Cc: Mel Gorman <mgorman@...e.de>
> Cc: Xuewen Yan <xuewen.yan94@...il.com>
> Cc: K Prateek Nayak <kprateek.nayak@....com>
> Cc: Suleiman Souhlal <suleiman@...gle.com>
> Cc: Qais Yousef <qyousef@...alina.io>
> Cc: Joel Fernandes <joelagnelf@...dia.com>
> Cc: kuyo chang <kuyo.chang@...iatek.com>
> Cc: hupu <hupu.gm@...il.com>
> Cc: kernel-team@...roid.com
> ---
>  kernel/sched/deadline.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index f25301267e471..215c3e2cee370 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1152,8 +1152,6 @@ static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
>  /* a defer timer will not be reset if the runtime consumed was < dl_server_min_res */
>  static const u64 dl_server_min_res = 1 * NSEC_PER_MSEC;
>  
> -static bool dl_server_stopped(struct sched_dl_entity *dl_se);
> -
>  static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_dl_entity *dl_se)
>  {
>  	struct rq *rq = rq_of_dl_se(dl_se);
> @@ -1173,7 +1171,7 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
>  
>  		if (!dl_se->server_has_tasks(dl_se)) {
>  			replenish_dl_entity(dl_se);
> -			dl_server_stopped(dl_se);
> +			dl_server_stop(dl_se);
>  			return HRTIMER_NORESTART;
>  		}

It looks OK for a quick testing I've done. Also, it seems to make sense
to me. The defer timer has fired (we are executing the callback). If the
server hasn't got tasks to serve we can just stop it (clearing the
flags) and wait for the next enqueue of fair to start it again still in
defer mode. hrtimer_try_to_cancel() is redundant (but harmless),
dequeue_dl_entity() I believe we need to call to deal with
task_non_contending().

Peter, what do you think?

Thanks,
Juri