[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXxeF4g4ME_PoQAO@jlelli-thinkpadt14gen4.remote.csb>
Date: Fri, 30 Jan 2026 08:30:31 +0100
From: Juri Lelli <juri.lelli@...hat.com>
To: Andrea Righi <arighi@...dia.com>
Cc: gmonaco@...hat.com, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>, Tejun Heo <tj@...nel.org>,
Joel Fernandes <joelagnelf@...dia.com>,
David Vernet <void@...ifault.com>,
Changwoo Min <changwoo@...lia.com>,
Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] sched/deadline: Reset dl_server execution state on
stop
Hello,
On 29/01/26 18:32, Andrea Righi wrote:
> Hi Gabriele,
>
> On Thu, Jan 29, 2026 at 12:48:35PM +0100, gmonaco@...hat.com wrote:
> > On Wed, 2026-01-28 at 14:41 +0100, Andrea Righi wrote:
> > > Just to make sure we're testing the same thing, I'm currently using
> > > https://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git,
> > > branch
> > > scx-dl-server.
> > >
> > > I'm running this test inside virtme-ng:
> > > $ vng -vb --config tools/testing/selftests/sched_ext/config
> > > $ vng -v -- tools/testing/selftests/sched_ext/runner -t rt_stall
> >
> > Well, that's a fun one, I could reproduce the same failure you
> > described in vng on another x86 box.
> >
> > The arm box (bare metal) I used initially still passes just fine all 4
> > iterations of the test.
> >
> >
> > On the x86 box (vng) I tried different orders of iterations (where the
> > original is fair-ext-fair-ext) with and without the ext server active.
> >
> > No ext-server: the ext iteration fails and breaks also fair (unlike the
> > arm64 box where the fair was intact)
> > ext-server active: a sequence fair-ext breaks both (like you observe).
> >
> > I don't have time to look further into this right now, but it looks
> > like an interesting pattern.
>
> Thanks for checking and reproducing it.
>
> Considering that these issues around DL server stop/start transitions can
> be triggered introducing an additional DL server (EXT) makes me wonder
> whether this could become even more problematic as we add more DL servers
> (hierarchical DL servers?).
>
> Considering that unconditionally clearing dl_defer_running in
> dl_server_stop() seems to re-establish a clear state-machine workflow,
> I think we should go with that fix for now, so we can unblock the EXT DL
> server patch set. With that change in place, all the server combinations
> and sequences I've tested seem to behave consistently.
>
> We can always revisit preserving the short-sleep optimization later if we
> find a way to do it with stronger guarantees (and I'll keep investigating
> on this), but for now the unconditional reset seems like the most robust
> fix to me.
>
> Opinions? Peter / Juri?
Hummm, I now however fear that always cleaning on stop would reintroduce
the issue John Stultz reported a while ago where boosted tasks would
need to wait for an entire new period after sleeping briefly. Would it?
Would an hybrid approach be feasible? Can we do "the right thing" (what
Gabriele suggests?) during normal operation and cleanup state only on
server unload/load?
Thanks,
Juri
Powered by blists - more mailing lists