[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z9w1X_GDIYV1CmIs@gpd3>
Date: Thu, 20 Mar 2025 16:33:51 +0100
From: Andrea Righi <arighi@...dia.com>
To: Joel Fernandes <joelagnelf@...dia.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
Changwoo Min <changwoo@...lia.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCHSET v5 sched_ext/for-6.15] sched_ext: Enhance built-in
idle selection with allowed CPUs
On Thu, Mar 20, 2025 at 03:05:37PM +0100, Joel Fernandes wrote:
> On 3/20/2025 8:36 AM, Andrea Righi wrote:
...
> > Example usage
> > =============
> >
> > s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p,
> > s32 prev_cpu, u64 wake_flags)
> > {
> > const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr;
> > s32 cpu;
>
> Andrea, I'm curious why cannot this expression simply be moved into the default
> select implementation? And then for those that need a more custom mask, we can
> do the scx_bpf_select_cpu_and() as a second step.
Yeah, maybe the example could be improved a bit. Basically I'm doing
task_allowed_cpus(p) ?: p->cpus_ptr to highlight that you can't pass NULL
as the extra "and" cpumask (otherwise the verifier won't be happy).
Also, if you call the old scx_bpf_select_cpu_dfl(), the internal logic
already uses the same backend as scx_bpf_select_cpu_and() passing
p->cpus_ptr as @cpus_allowed.
>
> Also I think I am missing, what is the motivation in the existing code to not do
> LLC/NUMA-only scans if the task is restrained? Thanks for clarifying.
You can use the "flags" argument to restrict the selection to the current
node, setting SCX_PICK_IDLE_IN_NODE.
We currently don't have a SCX_PICK_IDLE_IN_LLC flag (it'd be nice to
introduce it), so currently the only way to restrict the selection to the
current LLC is to use the additional "and" cpumask (@cpus_allowed), passing
the LLC span.
Thanks,
-Andrea
>
> thanks,
>
> - Joel
>
>
>
> >
> > cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0);
> > if (cpu >= 0) {
> > scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
> > return cpu;
> > }
> >
> > return prev_cpu;
> > }
> >
> > Results
> > =======
> >
> > Load distribution on a 4 sockets / 4 cores per socket system, simulated
> > using virtme-ng, running a modified version of scx_bpfland that uses the
> > new helper scx_bpf_select_cpu_and() and 0xff00 as allowed domain:
> >
> > $ vng --cpu 16,sockets=4,cores=4,threads=1
> > ...
> > $ stress-ng -c 16
> > ...
> > $ htop
> > ...
> > 0[ 0.0%] 8[||||||||||||||||||||||||100.0%]
> > 1[ 0.0%] 9[||||||||||||||||||||||||100.0%]
> > 2[ 0.0%] 10[||||||||||||||||||||||||100.0%]
> > 3[ 0.0%] 11[||||||||||||||||||||||||100.0%]
> > 4[ 0.0%] 12[||||||||||||||||||||||||100.0%]
> > 5[ 0.0%] 13[||||||||||||||||||||||||100.0%]
> > 6[ 0.0%] 14[||||||||||||||||||||||||100.0%]
> > 7[ 0.0%] 15[||||||||||||||||||||||||100.0%]
> >
> > With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across all
> > the available CPUs.
> >
> > ChangeLog v4 -> v5:
> > - simplify the code to compute (and) task's temporary cpumasks
> >
> > ChangeLog v3 -> v4:
> > - keep p->nr_cpus_allowed optimizations (skip cpumask operations when the
> > task can run on all CPUs)
> > - allow to call scx_bpf_select_cpu_and() also from ops.enqueue() and
> > modify the kselftest to cover this case as well
> > - rebase to the latest sched_ext/for-6.15
> >
> > ChangeLog v2 -> v3:
> > - incrementally refactor scx_select_cpu_dfl() to accept idle flags and an
> > arbitrary allowed cpumask
> > - build scx_bpf_select_cpu_and() on top of the existing logic
> > - re-arrange scx_select_cpu_dfl() prototype, aligning the first three
> > arguments with select_task_rq()
> > - do not use "domain" for the allowed cpumask to avoid potential ambiguity
> > with sched_domain
> >
> > ChangeLog v1 -> v2:
> > - rename scx_bpf_select_cpu_pref() to scx_bpf_select_cpu_and() and always
> > select idle CPUs strictly within the allowed domain
> > - rename preferred CPUs -> allowed CPU
> > - drop %SCX_PICK_IDLE_IN_PREF (not required anymore)
> > - deprecate scx_bpf_select_cpu_dfl() in favor of scx_bpf_select_cpu_and()
> > and provide all the required backward compatibility boilerplate
> >
> > Andrea Righi (6):
> > sched_ext: idle: Extend topology optimizations to all tasks
> > sched_ext: idle: Explicitly pass allowed cpumask to scx_select_cpu_dfl()
> > sched_ext: idle: Accept an arbitrary cpumask in scx_select_cpu_dfl()
> > sched_ext: idle: Introduce scx_bpf_select_cpu_and()
> > selftests/sched_ext: Add test for scx_bpf_select_cpu_and()
> > sched_ext: idle: Deprecate scx_bpf_select_cpu_dfl()
> >
> > Documentation/scheduler/sched-ext.rst | 11 +-
> > kernel/sched/ext.c | 6 +-
> > kernel/sched/ext_idle.c | 196 ++++++++++++++++-----
> > kernel/sched/ext_idle.h | 3 +-
> > tools/sched_ext/include/scx/common.bpf.h | 5 +-
> > tools/sched_ext/include/scx/compat.bpf.h | 37 ++++
> > tools/sched_ext/scx_flatcg.bpf.c | 12 +-
> > tools/sched_ext/scx_simple.bpf.c | 9 +-
> > tools/testing/selftests/sched_ext/Makefile | 1 +
> > .../testing/selftests/sched_ext/allowed_cpus.bpf.c | 121 +++++++++++++
> > tools/testing/selftests/sched_ext/allowed_cpus.c | 57 ++++++
> > .../selftests/sched_ext/enq_select_cpu_fails.bpf.c | 12 +-
> > .../selftests/sched_ext/enq_select_cpu_fails.c | 2 +-
> > tools/testing/selftests/sched_ext/exit.bpf.c | 6 +-
> > .../sched_ext/select_cpu_dfl_nodispatch.bpf.c | 13 +-
> > .../sched_ext/select_cpu_dfl_nodispatch.c | 2 +-
> > 16 files changed, 404 insertions(+), 89 deletions(-)
> > create mode 100644 tools/testing/selftests/sched_ext/allowed_cpus.bpf.c
> > create mode 100644 tools/testing/selftests/sched_ext/allowed_cpus.c
>
Powered by blists - more mailing lists