linux-kernel - Re: [PATCH v2] sched_ext: Introduce NUMA awareness to the default idle selection policy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <Zx31CNqrI4TWgSDI@gpd3>
Date: Sun, 27 Oct 2024 09:08:40 +0100
From: Andrea Righi <arighi@...dia.com>
To: Tejun Heo <tj@...nel.org>
Cc: David Vernet <void@...ifault.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] sched_ext: Introduce NUMA awareness to the default
 idle selection policy

On Fri, Oct 25, 2024 at 10:02:31AM -1000, Tejun Heo wrote:
> External email: Use caution opening links or attachments
> 
> 
> Hello,
> 
> On Fri, Oct 25, 2024 at 06:25:35PM +0200, Andrea Righi wrote:
> ...
> > +static DEFINE_STATIC_KEY_FALSE(scx_topology_llc);
> > +static DEFINE_STATIC_KEY_FALSE(scx_topology_numa);
> 
> Maybe name them sth like scx_selcpu_topo_llc given that this is only used by
> selcpu?

Ok.

> 
> > +static void init_topology(void)
> 
> Ditto with naming.

Ok.

> 
> >  {
> > -     struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, cpu));
> > -     const struct cpumask *llc_cpus = sd ? sched_domain_span(sd) : NULL;
> > +     const struct cpumask *cpus;
> > +     int nid;
> > +     s32 cpu;
> > +
> > +     /*
> > +      * Detect if the system has multiple NUMA nodes distributed across the
> > +      * available CPUs and, in that case, enable NUMA-aware scheduling in
> > +      * the default CPU idle selection policy.
> > +      */
> > +     for_each_node(nid) {
> > +             cpus = cpumask_of_node(nid);
> > +             if (cpumask_weight(cpus) < nr_cpu_ids) {
> 
> Comparing number of cpus with nr_cpu_ids doesn't work. The above condition
> can trigger on single node machines with some CPUs offlines or unavailable
> for example. I think num_node_state(N_CPU) should work or if you want to
> keep with sched_domains, maybe highest_flag_domain(some_cpu,
> SD_NUMA)->groups->weight would work?

Ok, checking num_possible_cpus() instead of nr_cpu_ids makes more sense.

I was also thinking to refresh the static keys on hotplug events and
check for num_possible_cpus(), in this way the topology optimizations
should be always (more) consistent, even when some of the CPUs are going
offline/online. Old tasks won't update their p->nr_cpus_allowed I guess,
but worst case they may miss some NUMA/LLC optimizations. Maybe we can
add a generation counter and rely on scx_hotplug_seq to handle this case
in a more precise way (like updating a local cpumask), but it seems a
bit overkill...

About node_state(nid, N_CPU), I've done some tests and it doesn't seem
to work well for this scenario: it correctly returns 0 in case of
memory-only NUMA nodes, but for example if I start a VM with a single
NUMA node and I assign all the CPUs to that node, node_state(nid, N_CPU)
returns 1 (correctly), but in our case the node should be considered
like a memory-only node, since it includes all the possible CPUs.

I've also tried to rely on sd_numa (similar to sd_llc), but it also
doesn't seem to work as expected (this might be a bug? I'll investigate
separately), because if I start a VM with 2 NUMA nodes (assigning half
of the CPUs to node 1 and the other half to node 2), sd_numa still
reports all CPUs assigned to the same node.

Instead, highest_flag_domain(cpu, SD_NUMA)->groups seems to work as
expected, and since the logic is also based on sched_domain like the LLC
one, I definitely prefer this approach, thanks for the suggestions!

-Andrea

> 
> ...
> > +     for_each_possible_cpu(cpu) {
> > +             struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, cpu));
> > +
> > +             if (!sd)
> > +                     continue;
> > +             cpus = sched_domain_span(sd);
> > +             if (cpumask_weight(cpus) < nr_cpu_ids) {
> 
> Ditto.
> 
> ...
> > +     /*
> > +      * Determine the scheduling domain only if the task is allowed to run
> > +      * on all CPUs.
> > +      *
> > +      * This is done primarily for efficiency, as it avoids the overhead of
> > +      * updating a cpumask every time we need to select an idle CPU (which
> > +      * can be costly in large SMP systems), but it also aligns logically:
> > +      * if a task's scheduling domain is restricted by user-space (through
> > +      * CPU affinity), the task will simply use the flat scheduling domain
> > +      * defined by user-space.
> > +      */
> > +     if (p->nr_cpus_allowed == nr_cpu_ids) {
> 
> Should compare against nr_possible_cpus.
> 
> Thanks.
> 
> --
> tejun