lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250916143044.GL3245006@noisy.programming.kicks-ass.net>
Date: Tue, 16 Sep 2025 16:30:44 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Juri Lelli <juri.lelli@...hat.com>
Cc: John Stultz <jstultz@...gle.com>, LKML <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Valentin Schneider <vschneid@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Xuewen Yan <xuewen.yan94@...il.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Suleiman Souhlal <suleiman@...gle.com>,
	Qais Yousef <qyousef@...alina.io>,
	Joel Fernandes <joelagnelf@...dia.com>,
	kuyo chang <kuyo.chang@...iatek.com>, hupu <hupu.gm@...il.com>,
	kernel-team@...roid.com
Subject: Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck,
 allowing cpu starvation

On Tue, Sep 16, 2025 at 02:52:44PM +0200, Juri Lelli wrote:
> On 16/09/25 13:01, Peter Zijlstra wrote:
> > On Tue, Sep 16, 2025 at 10:51:34AM +0200, Juri Lelli wrote:
> > 
> > > > @@ -1173,7 +1171,7 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
> > > >  
> > > >  		if (!dl_se->server_has_tasks(dl_se)) {
> > > >  			replenish_dl_entity(dl_se);
> > > > -			dl_server_stopped(dl_se);
> > > > +			dl_server_stop(dl_se);
> > > >  			return HRTIMER_NORESTART;
> > > >  		}
> > > 
> > > It looks OK for a quick testing I've done. Also, it seems to make sense
> > > to me. The defer timer has fired (we are executing the callback). If the
> > > server hasn't got tasks to serve we can just stop it (clearing the
> > > flags) and wait for the next enqueue of fair to start it again still in
> > > defer mode. hrtimer_try_to_cancel() is redundant (but harmless),
> > > dequeue_dl_entity() I believe we need to call to deal with
> > > task_non_contending().
> > > 
> > > Peter, what do you think?
> > 
> > Well, the problem was that we were starting/stopping the thing too
> > often, and the general idea of that commit:
> > 
> >   cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
> > 
> > was to not stop the server, unless it's not seen fair tasks for a whole
> > period.
> > 
> > Now, the case John trips seems to be that there were tasks, we ran tasks
> > until budget exhausted, dequeued the server and did start_dl_timer().
> > 
> > Then the bandwidth timer fires at a point where there are no more fair
> > tasks, replenish_dl_entity() gets called, which *should* set the
> > 0-laxity timer, but doesn't -- because !server_has_tasks() -- and then
> > nothing.
> > 
> > So perhaps we should do something like the below. Simply continue
> > as normal, until we do a whole cycle without having seen a task.
> > 
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 5b64bc621993..269ca2eb5ba9 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -875,7 +875,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> >  	 */
> >  	if (dl_se->dl_defer && !dl_se->dl_defer_running &&
> >  	    dl_time_before(rq_clock(dl_se->rq), dl_se->deadline - dl_se->runtime)) {
> > -		if (!is_dl_boosted(dl_se) && dl_se->server_has_tasks(dl_se)) {
> > +		if (!is_dl_boosted(dl_se)) {
> >  
> >  			/*
> >  			 * Set dl_se->dl_defer_armed and dl_throttled variables to
> > @@ -1171,12 +1171,6 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
> >  		if (!dl_se->dl_runtime)
> >  			return HRTIMER_NORESTART;
> >  
> > -		if (!dl_se->server_has_tasks(dl_se)) {
> > -			replenish_dl_entity(dl_se);
> > -			dl_server_stopped(dl_se);
> > -			return HRTIMER_NORESTART;
> > -		}
> > -
> >  		if (dl_se->dl_defer_armed) {
> >  			/*
> >  			 * First check if the server could consume runtime in background.
> > 
> > 
> > Notably, this removes all ->server_has_tasks() users, so if this works
> > and is correct, we can completely remove that callback and simplify
> > more.
> > 
> > Hmm?
> 
> But then what stops the server when the 0-laxity (defer) timer fires
> again a period down the line?

At that point we'll actually run the server, right? And then
__pick_task_dl() will hit the !p case and call dl_server_stopped().

If idle==1 it will actually stop the server, otherwise it will set
idle=1 and around we go.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ