linux-kernel - Re: [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z0bNDfGufRZLfAP6@gpd3>
Date: Wed, 27 Nov 2024 08:41:01 +0100
From: Andrea Righi <arighi@...dia.com>
To: Yury Norov <yury.norov@...il.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks

On Tue, Nov 26, 2024 at 06:35:16PM -0800, Yury Norov wrote:
> On Tue, Nov 26, 2024 at 10:56:40AM +0100, Andrea Righi wrote:
> > Using a single global idle mask can lead to inefficiencies and a lot of
> > stress on the cache coherency protocol on large systems with multiple
> > NUMA nodes, since all the CPUs can create a really intense read/write
> > activity on the single global cpumask.
> >
> > Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> > to improve scalability and performance on large systems.
> >
> > The concept is that each cpumask will track only the idle CPUs within
> > its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> > In this way concurrent access to the idle cpumask will be restricted
> > within each NUMA node.
> 
> This all is definitely a complication of the code. Have you any numbers to
> justify it?

I'll include some numbers in v2. For now I just wanted to send the patch
out to start some discussion. :)

> 
> > [Open issue]
> >
> > The scx_bpf_get_idle_cpu/smtmask() kfunc's, that are supposed to return
> > a single cpumask for all the CPUs, have been changed to report only the
> > cpumask of the current NUMA node (using the current CPU); this breaks
> > the old behavior, so it can potentially introduce regressions in some
> > scx schedulers.
> >
> > An alternative approach could be to construct a global cpumask
> > on-the-fly, but this could add significant overhead to ops.select_cpu()
> > for schedulers relying on these kfunc's. Additionally, it would be less
> > reliable than accessing the actual cpumasks, as the copy could quickly
> > become out of sync and not represent the actual idle state very well.
> >
> > Probably a better way to solve this issue is to introduce new kfunc's to
> > explicitly select specific per-NUMA cpumask and modify the scx
> > schedulers to transition to this new API, for example:
> >
> >   const struct cpumask *scx_bpf_get_idle_numa_cpumask(int node)
> >   const struct cpumask *scx_bpf_get_idle_numa_smtmask(int node)
> >
> > Signed-off-by: Andrea Righi <arighi@...dia.com>
> > ---
...
> > @@ -3156,39 +3186,48 @@ static bool test_and_clear_cpu_idle(int cpu)
> >        */
> >       if (sched_smt_active()) {
> >               const struct cpumask *smt = cpu_smt_mask(cpu);
> > +             struct cpumask *idle_smt = get_idle_smtmask(cpu);
> >
> >               /*
> >                * If offline, @cpu is not its own sibling and
> >                * scx_pick_idle_cpu() can get caught in an infinite loop as
> > -              * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
> > -              * is eventually cleared.
> > +              * @cpu is never cleared from the idle SMT mask. Ensure that
> > +              * @cpu is eventually cleared.
> >                */
> > -             if (cpumask_intersects(smt, idle_masks.smt))
> > -                     cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> > -             else if (cpumask_test_cpu(cpu, idle_masks.smt))
> > -                     __cpumask_clear_cpu(cpu, idle_masks.smt);
> > +             if (cpumask_intersects(smt, idle_smt))
> > +                     cpumask_andnot(idle_smt, idle_smt, smt);
> 
> cpumask_andnot() is a no-op if masks don't intersect.
> cpumask_intersects() is O(N), and you can safely drop it and
> save some cycles.
> 
> > +             else if (cpumask_test_cpu(cpu, idle_smt))
> > +                     __cpumask_clear_cpu(cpu, idle_smt);
> 
> Same here. If the CPU is clear, __cpumask_clear_cpu() is a no-op.
> I feel like you can just clear all that CPUs unconditionally. In
> the worst case, you'll clear them twice, which is harmless. Or I
> misunderstand something?

Yeah, makes sense to me, it definitely seems more efficient to just
clear the CPU. Thanks for noticing it.

> 
> >       }
> >  #endif
> > -     return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
> > +     return cpumask_test_and_clear_cpu(cpu, get_idle_cpumask(cpu));
> >  }
> >
> >  static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> >  {
> > -     int cpu;
> > +     int start = cpu_to_node(smp_processor_id());
> > +     int node, cnt, cpu;
> >
> >  retry:
> >       if (sched_smt_active()) {
> > -             cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> > -             if (cpu < nr_cpu_ids)
> > -                     goto found;
> > +             for_each_node_state_from(node, N_ONLINE, start, cnt) {
> > +                     cpu = cpumask_any_and_distribute(idle_masks[node]->smt,
> > +                                                      cpus_allowed);
> 
> Nit: no need to break the line here.

Ok.

> 
> > +                     if (cpu < nr_cpu_ids)
> > +                             goto found;
> > +             }
> >
> >               if (flags & SCX_PICK_IDLE_CORE)
> >                       return -EBUSY;
> >       }
> >
> > -     cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
> > -     if (cpu >= nr_cpu_ids)
> > -             return -EBUSY;
> > +     for_each_node_state_from(node, N_ONLINE, start, cnt) {
> > +             cpu = cpumask_any_and_distribute(idle_masks[node]->cpu,
> > +                                              cpus_allowed);
> > +             if (cpu < nr_cpu_ids)
> > +                     goto found;
> > +     }
> > +     return -EBUSY;
> >
> >  found:
> >       if (test_and_clear_cpu_idle(cpu))
> > @@ -3401,7 +3440,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >                * piled up on it even if there is an idle core elsewhere on
> >                * the system.
> >                */
> > -             if (!cpumask_empty(idle_masks.cpu) &&
> > +             if (!cpumask_empty(get_idle_cpumask(cpu)) &&
> >                   !(current->flags & PF_EXITING) &&
> >                   cpu_rq(cpu)->scx.local_dsq.nr == 0) {
> >                       if (cpumask_test_cpu(cpu, p->cpus_ptr))
> 
> cpumask_empty() is O(N), and the other checks are O(1). If you reorder
> them such that cpumask_empty() would be the last check, you can
> increase probability that you will not check it at all.

Ok.

> 
> > @@ -3417,7 +3456,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >               /*
> >                * Keep using @prev_cpu if it's part of a fully idle core.
> >                */
> > -             if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
> > +             if (cpumask_test_cpu(prev_cpu, get_idle_smtmask(prev_cpu)) &&
> >                   test_and_clear_cpu_idle(prev_cpu)) {
> >                       cpu = prev_cpu;
> >                       goto cpu_found;
> > @@ -3560,12 +3599,18 @@ static void set_cpus_allowed_scx(struct task_struct *p,
> >
> >  static void reset_idle_masks(void)
> >  {
> > +     int node;
> > +
> >       /*
> >        * Consider all online cpus idle. Should converge to the actual state
> >        * quickly.
> >        */
> > -     cpumask_copy(idle_masks.cpu, cpu_online_mask);
> > -     cpumask_copy(idle_masks.smt, cpu_online_mask);
> > +     for_each_node_state(node, N_POSSIBLE) {
> > +             const struct cpumask *node_mask = cpumask_of_node(node);
> > +
> > +             cpumask_and(idle_masks[node]->cpu, cpu_online_mask, node_mask);
> > +             cpumask_and(idle_masks[node]->smt, cpu_online_mask, node_mask);
> 
> cpumask_copy(idle_masks[node]->smt, idle_masks[node]->cpu)
> 

Ok.

> > +     }
> >  }
> >
> >  void __scx_update_idle(struct rq *rq, bool idle)
> > @@ -3579,13 +3624,15 @@ void __scx_update_idle(struct rq *rq, bool idle)
> >       }
> >
> >       if (idle)
> > -             cpumask_set_cpu(cpu, idle_masks.cpu);
> > +             cpumask_set_cpu(cpu, get_idle_cpumask(cpu));
> >       else
> > -             cpumask_clear_cpu(cpu, idle_masks.cpu);
> > +             cpumask_clear_cpu(cpu, get_idle_cpumask(cpu));
> 
> assign_cpu(cpu, get_idle_cpumask(cpu), idle)

Ok.

> 
> Thanks,
> Yury

Thanks for the review! Will include all these changes in v2.

-Andrea