[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z0bNDfGufRZLfAP6@gpd3>
Date: Wed, 27 Nov 2024 08:41:01 +0100
From: Andrea Righi <arighi@...dia.com>
To: Yury Norov <yury.norov@...il.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks
On Tue, Nov 26, 2024 at 06:35:16PM -0800, Yury Norov wrote:
> On Tue, Nov 26, 2024 at 10:56:40AM +0100, Andrea Righi wrote:
> > Using a single global idle mask can lead to inefficiencies and a lot of
> > stress on the cache coherency protocol on large systems with multiple
> > NUMA nodes, since all the CPUs can create a really intense read/write
> > activity on the single global cpumask.
> >
> > Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> > to improve scalability and performance on large systems.
> >
> > The concept is that each cpumask will track only the idle CPUs within
> > its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> > In this way concurrent access to the idle cpumask will be restricted
> > within each NUMA node.
>
> This all is definitely a complication of the code. Have you any numbers to
> justify it?
I'll include some numbers in v2. For now I just wanted to send the patch
out to start some discussion. :)
>
> > [Open issue]
> >
> > The scx_bpf_get_idle_cpu/smtmask() kfunc's, that are supposed to return
> > a single cpumask for all the CPUs, have been changed to report only the
> > cpumask of the current NUMA node (using the current CPU); this breaks
> > the old behavior, so it can potentially introduce regressions in some
> > scx schedulers.
> >
> > An alternative approach could be to construct a global cpumask
> > on-the-fly, but this could add significant overhead to ops.select_cpu()
> > for schedulers relying on these kfunc's. Additionally, it would be less
> > reliable than accessing the actual cpumasks, as the copy could quickly
> > become out of sync and not represent the actual idle state very well.
> >
> > Probably a better way to solve this issue is to introduce new kfunc's to
> > explicitly select specific per-NUMA cpumask and modify the scx
> > schedulers to transition to this new API, for example:
> >
> > const struct cpumask *scx_bpf_get_idle_numa_cpumask(int node)
> > const struct cpumask *scx_bpf_get_idle_numa_smtmask(int node)
> >
> > Signed-off-by: Andrea Righi <arighi@...dia.com>
> > ---
...
> > @@ -3156,39 +3186,48 @@ static bool test_and_clear_cpu_idle(int cpu)
> > */
> > if (sched_smt_active()) {
> > const struct cpumask *smt = cpu_smt_mask(cpu);
> > + struct cpumask *idle_smt = get_idle_smtmask(cpu);
> >
> > /*
> > * If offline, @cpu is not its own sibling and
> > * scx_pick_idle_cpu() can get caught in an infinite loop as
> > - * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
> > - * is eventually cleared.
> > + * @cpu is never cleared from the idle SMT mask. Ensure that
> > + * @cpu is eventually cleared.
> > */
> > - if (cpumask_intersects(smt, idle_masks.smt))
> > - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> > - else if (cpumask_test_cpu(cpu, idle_masks.smt))
> > - __cpumask_clear_cpu(cpu, idle_masks.smt);
> > + if (cpumask_intersects(smt, idle_smt))
> > + cpumask_andnot(idle_smt, idle_smt, smt);
>
> cpumask_andnot() is a no-op if masks don't intersect.
> cpumask_intersects() is O(N), and you can safely drop it and
> save some cycles.
>
> > + else if (cpumask_test_cpu(cpu, idle_smt))
> > + __cpumask_clear_cpu(cpu, idle_smt);
>
> Same here. If the CPU is clear, __cpumask_clear_cpu() is a no-op.
> I feel like you can just clear all that CPUs unconditionally. In
> the worst case, you'll clear them twice, which is harmless. Or I
> misunderstand something?
Yeah, makes sense to me, it definitely seems more efficient to just
clear the CPU. Thanks for noticing it.
>
> > }
> > #endif
> > - return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
> > + return cpumask_test_and_clear_cpu(cpu, get_idle_cpumask(cpu));
> > }
> >
> > static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> > {
> > - int cpu;
> > + int start = cpu_to_node(smp_processor_id());
> > + int node, cnt, cpu;
> >
> > retry:
> > if (sched_smt_active()) {
> > - cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> > - if (cpu < nr_cpu_ids)
> > - goto found;
> > + for_each_node_state_from(node, N_ONLINE, start, cnt) {
> > + cpu = cpumask_any_and_distribute(idle_masks[node]->smt,
> > + cpus_allowed);
>
> Nit: no need to break the line here.
Ok.
>
> > + if (cpu < nr_cpu_ids)
> > + goto found;
> > + }
> >
> > if (flags & SCX_PICK_IDLE_CORE)
> > return -EBUSY;
> > }
> >
> > - cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
> > - if (cpu >= nr_cpu_ids)
> > - return -EBUSY;
> > + for_each_node_state_from(node, N_ONLINE, start, cnt) {
> > + cpu = cpumask_any_and_distribute(idle_masks[node]->cpu,
> > + cpus_allowed);
> > + if (cpu < nr_cpu_ids)
> > + goto found;
> > + }
> > + return -EBUSY;
> >
> > found:
> > if (test_and_clear_cpu_idle(cpu))
> > @@ -3401,7 +3440,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> > * piled up on it even if there is an idle core elsewhere on
> > * the system.
> > */
> > - if (!cpumask_empty(idle_masks.cpu) &&
> > + if (!cpumask_empty(get_idle_cpumask(cpu)) &&
> > !(current->flags & PF_EXITING) &&
> > cpu_rq(cpu)->scx.local_dsq.nr == 0) {
> > if (cpumask_test_cpu(cpu, p->cpus_ptr))
>
> cpumask_empty() is O(N), and the other checks are O(1). If you reorder
> them such that cpumask_empty() would be the last check, you can
> increase probability that you will not check it at all.
Ok.
>
> > @@ -3417,7 +3456,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> > /*
> > * Keep using @prev_cpu if it's part of a fully idle core.
> > */
> > - if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
> > + if (cpumask_test_cpu(prev_cpu, get_idle_smtmask(prev_cpu)) &&
> > test_and_clear_cpu_idle(prev_cpu)) {
> > cpu = prev_cpu;
> > goto cpu_found;
> > @@ -3560,12 +3599,18 @@ static void set_cpus_allowed_scx(struct task_struct *p,
> >
> > static void reset_idle_masks(void)
> > {
> > + int node;
> > +
> > /*
> > * Consider all online cpus idle. Should converge to the actual state
> > * quickly.
> > */
> > - cpumask_copy(idle_masks.cpu, cpu_online_mask);
> > - cpumask_copy(idle_masks.smt, cpu_online_mask);
> > + for_each_node_state(node, N_POSSIBLE) {
> > + const struct cpumask *node_mask = cpumask_of_node(node);
> > +
> > + cpumask_and(idle_masks[node]->cpu, cpu_online_mask, node_mask);
> > + cpumask_and(idle_masks[node]->smt, cpu_online_mask, node_mask);
>
> cpumask_copy(idle_masks[node]->smt, idle_masks[node]->cpu)
>
Ok.
> > + }
> > }
> >
> > void __scx_update_idle(struct rq *rq, bool idle)
> > @@ -3579,13 +3624,15 @@ void __scx_update_idle(struct rq *rq, bool idle)
> > }
> >
> > if (idle)
> > - cpumask_set_cpu(cpu, idle_masks.cpu);
> > + cpumask_set_cpu(cpu, get_idle_cpumask(cpu));
> > else
> > - cpumask_clear_cpu(cpu, idle_masks.cpu);
> > + cpumask_clear_cpu(cpu, get_idle_cpumask(cpu));
>
> assign_cpu(cpu, get_idle_cpumask(cpu), idle)
Ok.
>
> Thanks,
> Yury
Thanks for the review! Will include all these changes in v2.
-Andrea
Powered by blists - more mailing lists