[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z1dVzm0WVZSayz8L@gpd3>
Date: Mon, 9 Dec 2024 21:40:46 +0100
From: Andrea Righi <arighi@...dia.com>
To: Yury Norov <yury.norov@...il.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
Changwoo Min <changwoo@...lia.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks
Hi Yury,
On Mon, Dec 09, 2024 at 11:32:56AM -0800, Yury Norov wrote:
> On Mon, Dec 09, 2024 at 11:40:55AM +0100, Andrea Righi wrote:
> > Using a single global idle mask can lead to inefficiencies and a lot of
> > stress on the cache coherency protocol on large systems with multiple
> > NUMA nodes, since all the CPUs can create a really intense read/write
> > activity on the single global cpumask.
> >
> > Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> > to improve scalability and performance on large systems.
> >
> > The concept is that each cpumask will track only the idle CPUs within
> > its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> > In this way concurrent access to the idle cpumask will be restricted
> > within each NUMA node.
> >
> > NOTE: the scx_bpf_get_idle_cpu/smtmask() kfunc's, that are supposed to
> > return a single cpumask for all the CPUs, have been changed to report
> > only the cpumask of the current NUMA node (using the current CPU).
> >
> > This is breaking the old behavior, but it will be addressed in the next
> > commits, introducing a new flag to switch between the old single global
> > flat idle cpumask or the multiple per-node cpumasks.
>
> Why don't you change the order of commits such that you first
> introduce the flag and then add new feature? That way you'll not have
> to explain yourself.
Good point! I'll refactor the patch set.
>
> Also, the kernel/sched/ext.c is already 7k+ LOCs. Can you move the
> per-node idle masks to a separate file? You can also make this feature
> configurable, and those who don't care (pretty much everyone except
> for PLATINUM 8570 victims, right?) will not have to even compile it.
>
> I'd like to see it enabled only for those who can really benefit from it.
Ok about moving it to a separate file.
I'm not completely convinvced about making it a config option, I think
it'd nice to allow individual scx schedulers to decide whether to use
the NUMA-aware idle selection or the flat idle selection logic. This can
also pave the way for future enhancements (i.e., introducing generic
sched domains).
Moreover, in terms of overhead, there's not much difference between a
scheduler that doesn't set SCX_OPS_BUILTIN_IDLE_PER_NODE and having a
single staticlly-built idle cpumask (in both cases we will still use a
single global cpumask).
Thanks,
-Andrea
Powered by blists - more mailing lists