linux-kernel - Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z9hcUSp6P72wT5ig@gpd3>
Date: Mon, 17 Mar 2025 18:30:57 +0100
From: Andrea Righi <arighi@...dia.com>
To: Tejun Heo <tj@...nel.org>
Cc: Joel Fernandes <joelagnelf@...dia.com>, linux-kernel@...r.kernel.org,
	David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>, Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine
 without WF_SYNC

On Mon, Mar 17, 2025 at 07:08:15AM -1000, Tejun Heo wrote:
> Hello, Joel.
> 
> On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote:
> > Consider that the previous CPU is cache affined to the waker's CPU and
> > is idle. Currently, scx's default select function only selects the
> > previous CPU in this case if WF_SYNC request is also made to wakeup on the
> > waker's CPU.
> > 
> > This means, without WF_SYNC, the previous CPU being cache affined to the
> > waker and is idle is not considered. This seems extreme. WF_SYNC is not
> > normally passed to the wakeup path outside of some IPC drivers but it is
> > very possible that the task is cache hot on previous CPU and shares
> > cache with the waker CPU. Lets avoid too many migrations and select the
> > previous CPU in such cases.
> 
> Hmm.. if !WF_SYNC:
> 
> 1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle
>    core in widening scopes.
> 
> 2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an
>    idle CPU in widening scopes.
> 
> So, it is considering prev_cpu, right? I think it's preferring idle core a
> bit too much - it probably doesn't make sense to cross the NUMA boundary if
> there is an idle CPU in this node, at least.

Yeah, we should probably be a bit more conservative by default and avoid
jumping across nodes if there are still idle CPUs within the node.

With the new scx_bpf_select_cpu_and() API [1] it'll be easier to enforce
that while still using the built-in idle policy (since we can specify idle
flags), but that doesn't preclude adjusting the default policy anyway, if
it makes more sense.

I guess the question is: what is more expensive in general on task wakeup?
1) a cross-node migration or 2) running on a partially busy SMT core?

-Andrea

[1] https://lore.kernel.org/all/20250314094827.167563-1-arighi@nvidia.com/

> 
> Isn't the cpus_share_cache() code block mostly about not doing
> waker-affining if prev_cpu of the wakee is close enough and idle, so
> waker-affining is likely to be worse?
> 
> Thanks.
> 
> -- 
> tejun