[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Zx31CNqrI4TWgSDI@gpd3>
Date: Sun, 27 Oct 2024 09:08:40 +0100
From: Andrea Righi <arighi@...dia.com>
To: Tejun Heo <tj@...nel.org>
Cc: David Vernet <void@...ifault.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] sched_ext: Introduce NUMA awareness to the default
idle selection policy
On Fri, Oct 25, 2024 at 10:02:31AM -1000, Tejun Heo wrote:
> External email: Use caution opening links or attachments
>
>
> Hello,
>
> On Fri, Oct 25, 2024 at 06:25:35PM +0200, Andrea Righi wrote:
> ...
> > +static DEFINE_STATIC_KEY_FALSE(scx_topology_llc);
> > +static DEFINE_STATIC_KEY_FALSE(scx_topology_numa);
>
> Maybe name them sth like scx_selcpu_topo_llc given that this is only used by
> selcpu?
Ok.
>
> > +static void init_topology(void)
>
> Ditto with naming.
Ok.
>
> > {
> > - struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, cpu));
> > - const struct cpumask *llc_cpus = sd ? sched_domain_span(sd) : NULL;
> > + const struct cpumask *cpus;
> > + int nid;
> > + s32 cpu;
> > +
> > + /*
> > + * Detect if the system has multiple NUMA nodes distributed across the
> > + * available CPUs and, in that case, enable NUMA-aware scheduling in
> > + * the default CPU idle selection policy.
> > + */
> > + for_each_node(nid) {
> > + cpus = cpumask_of_node(nid);
> > + if (cpumask_weight(cpus) < nr_cpu_ids) {
>
> Comparing number of cpus with nr_cpu_ids doesn't work. The above condition
> can trigger on single node machines with some CPUs offlines or unavailable
> for example. I think num_node_state(N_CPU) should work or if you want to
> keep with sched_domains, maybe highest_flag_domain(some_cpu,
> SD_NUMA)->groups->weight would work?
Ok, checking num_possible_cpus() instead of nr_cpu_ids makes more sense.
I was also thinking to refresh the static keys on hotplug events and
check for num_possible_cpus(), in this way the topology optimizations
should be always (more) consistent, even when some of the CPUs are going
offline/online. Old tasks won't update their p->nr_cpus_allowed I guess,
but worst case they may miss some NUMA/LLC optimizations. Maybe we can
add a generation counter and rely on scx_hotplug_seq to handle this case
in a more precise way (like updating a local cpumask), but it seems a
bit overkill...
About node_state(nid, N_CPU), I've done some tests and it doesn't seem
to work well for this scenario: it correctly returns 0 in case of
memory-only NUMA nodes, but for example if I start a VM with a single
NUMA node and I assign all the CPUs to that node, node_state(nid, N_CPU)
returns 1 (correctly), but in our case the node should be considered
like a memory-only node, since it includes all the possible CPUs.
I've also tried to rely on sd_numa (similar to sd_llc), but it also
doesn't seem to work as expected (this might be a bug? I'll investigate
separately), because if I start a VM with 2 NUMA nodes (assigning half
of the CPUs to node 1 and the other half to node 2), sd_numa still
reports all CPUs assigned to the same node.
Instead, highest_flag_domain(cpu, SD_NUMA)->groups seems to work as
expected, and since the logic is also based on sched_domain like the LLC
one, I definitely prefer this approach, thanks for the suggestions!
-Andrea
>
> ...
> > + for_each_possible_cpu(cpu) {
> > + struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, cpu));
> > +
> > + if (!sd)
> > + continue;
> > + cpus = sched_domain_span(sd);
> > + if (cpumask_weight(cpus) < nr_cpu_ids) {
>
> Ditto.
>
> ...
> > + /*
> > + * Determine the scheduling domain only if the task is allowed to run
> > + * on all CPUs.
> > + *
> > + * This is done primarily for efficiency, as it avoids the overhead of
> > + * updating a cpumask every time we need to select an idle CPU (which
> > + * can be costly in large SMP systems), but it also aligns logically:
> > + * if a task's scheduling domain is restricted by user-space (through
> > + * CPU affinity), the task will simply use the flat scheduling domain
> > + * defined by user-space.
> > + */
> > + if (p->nr_cpus_allowed == nr_cpu_ids) {
>
> Should compare against nr_possible_cpus.
>
> Thanks.
>
> --
> tejun
Powered by blists - more mailing lists