lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXxeF4g4ME_PoQAO@jlelli-thinkpadt14gen4.remote.csb>
Date: Fri, 30 Jan 2026 08:30:31 +0100
From: Juri Lelli <juri.lelli@...hat.com>
To: Andrea Righi <arighi@...dia.com>
Cc: gmonaco@...hat.com, Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>, Tejun Heo <tj@...nel.org>,
	Joel Fernandes <joelagnelf@...dia.com>,
	David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] sched/deadline: Reset dl_server execution state on
 stop

Hello,

On 29/01/26 18:32, Andrea Righi wrote:
> Hi Gabriele,
> 
> On Thu, Jan 29, 2026 at 12:48:35PM +0100, gmonaco@...hat.com wrote:
> > On Wed, 2026-01-28 at 14:41 +0100, Andrea Righi wrote:
> > > Just to make sure we're testing the same thing, I'm currently using
> > > https://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git,
> > > branch
> > > scx-dl-server.
> > > 
> > > I'm running this test inside virtme-ng:
> > >   $ vng -vb --config tools/testing/selftests/sched_ext/config
> > >   $ vng -v -- tools/testing/selftests/sched_ext/runner -t rt_stall
> > 
> > Well, that's a fun one, I could reproduce the same failure you
> > described in vng on another x86 box.
> > 
> > The arm box (bare metal) I used initially still passes just fine all 4
> > iterations of the test.
> > 
> > 
> > On the x86 box (vng) I tried different orders of iterations (where the
> > original is fair-ext-fair-ext) with and without the ext server active.
> > 
> > No ext-server: the ext iteration fails and breaks also fair (unlike the
> > arm64 box where the fair was intact)
> > ext-server active: a sequence fair-ext breaks both (like you observe).
> > 
> > I don't have time to look further into this right now, but it looks
> > like an interesting pattern.
> 
> Thanks for checking and reproducing it.
> 
> Considering that these issues around DL server stop/start transitions can
> be triggered introducing an additional DL server (EXT) makes me wonder
> whether this could become even more problematic as we add more DL servers
> (hierarchical DL servers?).
> 
> Considering that unconditionally clearing dl_defer_running in
> dl_server_stop() seems to re-establish a clear state-machine workflow,
> I think we should go with that fix for now, so we can unblock the EXT DL
> server patch set. With that change in place, all the server combinations
> and sequences I've tested seem to behave consistently.
> 
> We can always revisit preserving the short-sleep optimization later if we
> find a way to do it with stronger guarantees (and I'll keep investigating
> on this), but for now the unconditional reset seems like the most robust
> fix to me.
> 
> Opinions? Peter / Juri?

Hummm, I now however fear that always cleaning on stop would reintroduce
the issue John Stultz reported a while ago where boosted tasks would
need to wait for an entire new period after sleeping briefly. Would it?

Would an hybrid approach be feasible? Can we do "the right thing" (what
Gabriele suggests?) during normal operation and cleanup state only on
server unload/load?

Thanks,
Juri


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ