[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z9hcUSp6P72wT5ig@gpd3>
Date: Mon, 17 Mar 2025 18:30:57 +0100
From: Andrea Righi <arighi@...dia.com>
To: Tejun Heo <tj@...nel.org>
Cc: Joel Fernandes <joelagnelf@...dia.com>, linux-kernel@...r.kernel.org,
David Vernet <void@...ifault.com>,
Changwoo Min <changwoo@...lia.com>, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine
without WF_SYNC
On Mon, Mar 17, 2025 at 07:08:15AM -1000, Tejun Heo wrote:
> Hello, Joel.
>
> On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote:
> > Consider that the previous CPU is cache affined to the waker's CPU and
> > is idle. Currently, scx's default select function only selects the
> > previous CPU in this case if WF_SYNC request is also made to wakeup on the
> > waker's CPU.
> >
> > This means, without WF_SYNC, the previous CPU being cache affined to the
> > waker and is idle is not considered. This seems extreme. WF_SYNC is not
> > normally passed to the wakeup path outside of some IPC drivers but it is
> > very possible that the task is cache hot on previous CPU and shares
> > cache with the waker CPU. Lets avoid too many migrations and select the
> > previous CPU in such cases.
>
> Hmm.. if !WF_SYNC:
>
> 1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle
> core in widening scopes.
>
> 2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an
> idle CPU in widening scopes.
>
> So, it is considering prev_cpu, right? I think it's preferring idle core a
> bit too much - it probably doesn't make sense to cross the NUMA boundary if
> there is an idle CPU in this node, at least.
Yeah, we should probably be a bit more conservative by default and avoid
jumping across nodes if there are still idle CPUs within the node.
With the new scx_bpf_select_cpu_and() API [1] it'll be easier to enforce
that while still using the built-in idle policy (since we can specify idle
flags), but that doesn't preclude adjusting the default policy anyway, if
it makes more sense.
I guess the question is: what is more expensive in general on task wakeup?
1) a cross-node migration or 2) running on a partially busy SMT core?
-Andrea
[1] https://lore.kernel.org/all/20250314094827.167563-1-arighi@nvidia.com/
>
> Isn't the cpus_share_cache() code block mostly about not doing
> waker-affining if prev_cpu of the wakee is close enough and idle, so
> waker-affining is likely to be worse?
>
> Thanks.
>
> --
> tejun
Powered by blists - more mailing lists