[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aXuZysNQCVrfFzx7@gpd4>
Date: Thu, 29 Jan 2026 18:32:58 +0100
From: Andrea Righi <arighi@...dia.com>
To: gmonaco@...hat.com
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>, Tejun Heo <tj@...nel.org>,
Joel Fernandes <joelagnelf@...dia.com>,
David Vernet <void@...ifault.com>,
Changwoo Min <changwoo@...lia.com>,
Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] sched/deadline: Reset dl_server execution state on
stop
Hi Gabriele,
On Thu, Jan 29, 2026 at 12:48:35PM +0100, gmonaco@...hat.com wrote:
> On Wed, 2026-01-28 at 14:41 +0100, Andrea Righi wrote:
> > Just to make sure we're testing the same thing, I'm currently using
> > https://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git,
> > branch
> > scx-dl-server.
> >
> > I'm running this test inside virtme-ng:
> > $ vng -vb --config tools/testing/selftests/sched_ext/config
> > $ vng -v -- tools/testing/selftests/sched_ext/runner -t rt_stall
>
> Well, that's a fun one, I could reproduce the same failure you
> described in vng on another x86 box.
>
> The arm box (bare metal) I used initially still passes just fine all 4
> iterations of the test.
>
>
> On the x86 box (vng) I tried different orders of iterations (where the
> original is fair-ext-fair-ext) with and without the ext server active.
>
> No ext-server: the ext iteration fails and breaks also fair (unlike the
> arm64 box where the fair was intact)
> ext-server active: a sequence fair-ext breaks both (like you observe).
>
> I don't have time to look further into this right now, but it looks
> like an interesting pattern.
Thanks for checking and reproducing it.
Considering that these issues around DL server stop/start transitions can
be triggered introducing an additional DL server (EXT) makes me wonder
whether this could become even more problematic as we add more DL servers
(hierarchical DL servers?).
Considering that unconditionally clearing dl_defer_running in
dl_server_stop() seems to re-establish a clear state-machine workflow,
I think we should go with that fix for now, so we can unblock the EXT DL
server patch set. With that change in place, all the server combinations
and sequences I've tested seem to behave consistently.
We can always revisit preserving the short-sleep optimization later if we
find a way to do it with stronger guarantees (and I'll keep investigating
on this), but for now the unconditional reset seems like the most robust
fix to me.
Opinions? Peter / Juri?
Thanks,
-Andrea
Powered by blists - more mailing lists