linux-kernel - Re: [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z10gI2wV1FJ5lyTZ@gpd3>
Date: Sat, 14 Dec 2024 07:05:23 +0100
From: Andrea Righi <arighi@...dia.com>
To: Yury Norov <yury.norov@...il.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks

On Mon, Dec 09, 2024 at 06:10:45PM -0800, Yury Norov wrote:
> On Tue, Dec 10, 2024 at 01:14:43AM +0100, Andrea Righi wrote:
> > > And altogether this should look like:
> > > 
> > >  int scx_pick_idle_cpu_from_hop(struct cpumask *hop_cpus, struct cpumask *cpus_allowed)
> > >  {
> > >          int node, cpu, random_cpu;
> > > 
> > >          do {
> > > 
> > >                  /* Pick a 'random' CPU in the hop */
> > >                  random_cpu = cpumask_any_and_distribute(hop_cpus, cpus_allowed);
> > >                  if (random_cpu >= nr_cpu_ids)
> > >                          continue;
> > > 
> > >                  node = cpu_to_node(random_cpu);
> > > 
> > >                  /* Find an idle CPU in the same node */
> > >                  cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
> > >                  if (cpu >= 0)
> > >                          break;
> > > 
> > >                  /* No luck? Try other nodes */
> > >          } while (cpumask_andnot(hop_cpus, hop_cpus, cpumask_of_node(node)));
> > > 
> > >          return cpu;
> > >  }
> > > 
> > >  static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> > >  {
> > >         const struct cpumask *next, *prev = cpu_none_mask;
> > >         int prev_node = cpu_to_node(prev_cpu);
> > >  ...
> > > 	for_each_numa_hop_mask(next, prev_node) {
> > >                 cpumask_andnot(hop_cpus, next, prev);
> > >                 cpu = scx_pick_idle_cpu_from_hop(hop_cpus, cpus_allowed);
> > >                 prev = next;
> > >         }
> > >  ...
> > >  }
> > > 
> > > Not tested, but should work.
> > 
> > Makes sense to me, I'll do some testing with this.
> 
> One thing you can do to optimize it is introducing a function that
> pulls nodes from the hop_cpus:
> 
>  void sched_get_hop_nodes(nodemask_t *hop_nodes, const struct cpumask *hop_cpus)
>  {
>         int cpu;
> 
>         for_each_cpu(cpu, hop_cpus) {
>                node_set(cpu_to_node(cpu);, hop_nodes);
>                cpu = cpumask_next_zero(cpu, cpumask_of_node(node)
>         }
>  }
> 
> This should be O(N), but it will let you to avoid O(N*M) in the loop
> condition inside scx_pick_idle_cpu_from_hop():
> 
>  int scx_pick_idle_cpu_from_hop(nodemask_t *hop_nodes, struct cpumask *cpus_allowed)
>  {
>         int node, idle_cpu, random_cpu;
> 
>         for_each_node_mask(node, &hop_nodes) {
>                 /* Pick a 'random' CPU in the node */
>                 random_cpu = cpumask_any_and_distribute(cpumask_of_node(node), cpus_allowed);
>                 if (random_cpu >= nr_cpu_ids)
>                         continue;
> 
>                 /* Find an idle CPU in the same node */
>                 idle_cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
>                 if (idle_cpu >= 0)
>                         break;
> 
>         }
> 
>         return cpu;
>  }
> 
> And at this point I'd also compare the above with non-randomized
> version:
> 
>  static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
>  {
>         const struct cpumask *next, *prev = cpu_none_mask;
>         int prev_node = cpu_to_node(prev_cpu);
>         nodemask_t hop_nodes;
>  ...
> 	for_each_numa_hop_mask(next, prev_node) {
>                 if (!cpumask_and_andnot(hop_cpus, next, cpus_allow, prev))
>                         goto cont;
> 
>                 sched_get_hop_nodes(hop_nodes, hop_cpus);
>                 for_each_node_mask(node, hop_nodes) {
>                         cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
>                         if (cpu >= 0)
>                                 goto found;
>                 }
> 
>         cont:
>                 prev = next;
>         }
>  ...
>  }
> 
> Don't know how it works, but it looks really good.

I've done some tests with multiple variations of this. I think one of the
issues is that we need to allocate hop_cpus and it seems to introduce some
overhead (or it could be the operations on the cpumasks), but it seems to
pretty much nullify the benefits of the per-node idle cpumasks.

However, I came up with this one, which doesn't seem to introduce much
overhead:

int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state)
{
        int dist, n, min_node, min_dist;

        if (state >= NR_NODE_STATES)
                return NUMA_NO_NODE;

        min_node = NUMA_NO_NODE;
        min_dist = INT_MAX;

        for_each_node_state(n, state) {
                if (n == start || node_isset(n, *hop_nodes))
                        continue;
                dist = node_distance(start, n);
                if (dist < min_dist) {
                        min_dist = dist;
                        min_node = n;
                }
        }
        if (min_node != NUMA_NO_NODE)
                node_set(min_node, *hop_nodes);

        return min_node;
}

#define for_each_numa_hop_node(__node, __start, __hop_nodes, __state)           \
        for (int __node = __start;                                              \
             __node != NUMA_NO_NODE;                                            \
             __node = sched_numa_hop_node(&(__hop_nodes), __start, __state))

It's O(N^2) with the number of nodes, but it shouldn't be too bad, unless
we are in MAXSMP systems maybe, but even in in this case scx schedulers
have the option to use scx_bpf_pick_idle_cpu_node(), instead of the generic
scx_pick_idle_cpu(), and define their own criteria to iterate on the NUMA
nodes.

I'll do more testing, if we don't find any issue I'm think I'll go with
this one in the next patch set.

-Andrea