linux-kernel - Re: [PATCH 13/14] sched: Add {DE,EN}QUEUE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aNWNn4qj3UYmL0Q_@slm.duckdns.org>
Date: Thu, 25 Sep 2025 08:44:47 -1000
From: Tejun Heo <tj@...nel.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, juri.lelli@...hat.com,
	vincent.guittot@...aro.org, dietmar.eggemann@....com,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, longman@...hat.com, hannes@...xchg.org,
	mkoutny@...e.com, void@...ifault.com, arighi@...dia.com,
	changwoo@...lia.com, cgroups@...r.kernel.org,
	sched-ext@...ts.linux.dev, liuwenfang@...or.com, tglx@...utronix.de
Subject: Re: [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED

Hello,

On Thu, Sep 25, 2025 at 05:53:23PM +0200, Peter Zijlstra wrote:
> > CPUs can go on and offline while CPUs are being bypassed. We can handle that
> > in hotplug ops but I'm not sure the complexity is justified in this case.
> 
> Well, not in the current code, since the CPU running this has IRQs and
> preemption disabled (per bypass_lock) and thus stop_machine, as used in
> hotplug can't make progress.
> 
> That is; disabling preemption serializes against hotplug. This is
> something that the scheduler relies on in quite a few places.

Oh, I meant something like:

                                                        CPU X goes down

        scx_bypass(true);

        stuff happening in bypass mode.
        tasks are scheduling, sleeping and              CPU X comes up
        everything.

        scx_bypass(false);

When CPU X comes up, it should come up in bypass mode, which can easily be
done in online callback, but it's just a bit simpler to keep them always in
sync.

> > This is significantly more expensive. On large systems, the number of
> > threads can easily reach six digits. Iterating all of them while doing
> > locking ops on each of them might become problematic depending on what the
> > rest of the system is doing (unfortunately, it's not too difficult to cause
> > meltdowns on some NUMA systems with cross-node traffic). I don't think
> > p->tasks iterations can be broken up either.
> 
> I thought to have understood that bypass isn't something that happens
> when the system is happy. As long as it completes at some point all this
> should be fine right?
> 
> I mean, yeah, it'll take a while, but meh.
> 
> Also, we could run the thing at fair or FIFO-1 or something, to be
> outside of ext itself. Possibly we can freeze all the ext tasks on
> return to user to limit the amount of noise they generate.

One problem scenario that we saw with sapphire rapids multi socket machines
is that when there are a lot of cross-socket locking operations (same locks
getting hammered on from two sockets), forward progress slows down to the
point where hard lockup triggers really easily. We saw two problems in such
scenarios - the total throughput of locking operations was low and the
distribution of successes across CPUs was pretty skewed. Combining the two
factors, the slowest CPU on sapphire rapids ran about two orders of
magnitude slower than a similarly sized AMD machine doing the smae thing.
The benchmark became a part of stress-ng, the --flipflop.

Anyways, what this comes down to is that on some machines, scx_bypass(true)
has to be pretty careful to avoid these hard lockup scenarios as that's
what's expected to recover the system when such situations develop.

> > The change guard cleanups make sense
> > regardless of how the rest develops. Would it make sense to land them first?
> > Once we know what to do with the core scheduling locking, I'm sure we can
> > find a way to make this work accordingly.
> 
> Yeah, definitely. Thing is, if we can get all sched_change users to be
> the same, that all cleans up better.
> 
> But if cleaning this up gets to be too vexing, we can postpone that.

Yeah, I think it's just going to be a bit more involved and it'd be easier
if we don't make it block other stuff.

Thanks.

-- 
tejun