[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <36366fd3e9bc730300cc2262687c890f@igalia.com>
Date: Sat, 22 Mar 2025 12:56:11 +0900
From: Changwoo Min <changwoo@...lia.com>
To: Andrea Righi <arighi@...dia.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>, Joel
Fernandes <joelagnelf@...dia.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCHSET v6 sched_ext/for-6.15] sched_ext: Enhance built-in idle
selection with allowed CPUs
Hi Andrea,
Looks great to me.
Thanks!
Changwoo Min
On 2025-03-22 07:10, Andrea Righi wrote:
> Many scx schedulers implement their own hard or soft-affinity rules to
> support topology characteristics, such as heterogeneous architectures
> (e.g., big.LITTLE, P-cores/E-cores), or to categorize tasks based on
> specific properties (e.g., running certain tasks only in a subset of CPUs).
>
> Currently, there is no mechanism that allows to use the built-in idle CPU
> selection policy to an arbitrary subset of CPUs. As a result, schedulers
> often implement their own idle CPU selection policies, which are typically
> similar to one another, leading to a lot of code duplication.
>
> To address this, extend the built-in idle CPU selection policy introducing
> ]the concept of allowed CPUs.
>
> With this concept, BPF schedulers can apply the built-in idle CPU selection
> policy to a subset of allowed CPUs, allowing them to implement their own
> hard/soft-affinity rules while still using the topology optimizations of
> the built-in policy, preventing code duplication across different
> schedulers.
>
> To implement this introduce a new helper kfunc scx_bpf_select_cpu_and()
> that accepts a cpumask of allowed CPUs:
>
> s32 scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu,
> u64 wake_flags,
> const struct cpumask *cpus_allowed, u64 flags);
>
> Example usage
> =============
>
> s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p,
> s32 prev_cpu, u64 wake_flags)
> {
> const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr;
> s32 cpu;
>
> cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0);
> if (cpu >= 0) {
> scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
> return cpu;
> }
>
> return prev_cpu;
> }
>
> Results
> =======
>
> Load distribution on a 4 sockets / 4 cores per socket system, simulated
> using virtme-ng, running a modified version of scx_bpfland that uses the
> new helper scx_bpf_select_cpu_and() and 0xff00 as allowed domain:
>
> $ vng --cpu 16,sockets=4,cores=4,threads=1
> ...
> $ stress-ng -c 16
> ...
> $ htop
> ...
> 0[ 0.0%] 8[||||||||||||||||||||||||100.0%]
> 1[ 0.0%] 9[||||||||||||||||||||||||100.0%]
> 2[ 0.0%] 10[||||||||||||||||||||||||100.0%]
> 3[ 0.0%] 11[||||||||||||||||||||||||100.0%]
> 4[ 0.0%] 12[||||||||||||||||||||||||100.0%]
> 5[ 0.0%] 13[||||||||||||||||||||||||100.0%]
> 6[ 0.0%] 14[||||||||||||||||||||||||100.0%]
> 7[ 0.0%] 15[||||||||||||||||||||||||100.0%]
>
> With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across all
> the available CPUs.
>
> ChangeLog v5 -> v6:
> - prevent redundant cpumask_subset() + cpumask_equal() checks in all
> patches
> - remove cpumask_subset() + cpumask_and() combo with local cpumasks, as
> cpumask_and() alone is generally more efficient
> - cleanup patches to prevent unnecessary function renames
>
> ChangeLog v4 -> v5:
> - simplify code to compute the temporary task's cpumasks (and)
>
> ChangeLog v3 -> v4:
> - keep p->nr_cpus_allowed optimizations (skip cpumask operations when the
> task can run on all CPUs)
> - allow to call scx_bpf_select_cpu_and() also from ops.enqueue() and
> modify the kselftest to cover this case as well
> - rebase to the latest sched_ext/for-6.15
>
> ChangeLog v2 -> v3:
> - incrementally refactor scx_select_cpu_dfl() to accept idle flags and an
> arbitrary allowed cpumask
> - build scx_bpf_select_cpu_and() on top of the existing logic
> - re-arrange scx_select_cpu_dfl() prototype, aligning the first three
> arguments with select_task_rq()
> - do not use "domain" for the allowed cpumask to avoid potential ambiguity
> with sched_domain
>
> ChangeLog v1 -> v2:
> - rename scx_bpf_select_cpu_pref() to scx_bpf_select_cpu_and() and always
> select idle CPUs strictly within the allowed domain
> - rename preferred CPUs -> allowed CPU
> - drop %SCX_PICK_IDLE_IN_PREF (not required anymore)
> - deprecate scx_bpf_select_cpu_dfl() in favor of scx_bpf_select_cpu_and()
> and provide all the required backward compatibility boilerplate
>
> Andrea Righi (6):
> sched_ext: idle: Extend topology optimizations to all tasks
> sched_ext: idle: Explicitly pass allowed cpumask to scx_select_cpu_dfl()
> sched_ext: idle: Accept an arbitrary cpumask in scx_select_cpu_dfl()
> sched_ext: idle: Introduce scx_bpf_select_cpu_and()
> selftests/sched_ext: Add test for scx_bpf_select_cpu_and()
> sched_ext: idle: Deprecate scx_bpf_select_cpu_dfl()
>
> Documentation/scheduler/sched-ext.rst | 11 +-
> kernel/sched/ext.c | 6 +-
> kernel/sched/ext_idle.c | 196 ++++++++++++++++-----
> kernel/sched/ext_idle.h | 3 +-
> tools/sched_ext/include/scx/common.bpf.h | 5 +-
> tools/sched_ext/include/scx/compat.bpf.h | 37 ++++
> tools/sched_ext/scx_flatcg.bpf.c | 12 +-
> tools/sched_ext/scx_simple.bpf.c | 9 +-
> tools/testing/selftests/sched_ext/Makefile | 1 +
> .../testing/selftests/sched_ext/allowed_cpus.bpf.c | 121 +++++++++++++
> tools/testing/selftests/sched_ext/allowed_cpus.c | 57 ++++++
> .../selftests/sched_ext/enq_select_cpu_fails.bpf.c | 12 +-
> .../selftests/sched_ext/enq_select_cpu_fails.c | 2 +-
> tools/testing/selftests/sched_ext/exit.bpf.c | 6 +-
> .../sched_ext/select_cpu_dfl_nodispatch.bpf.c | 13 +-
> .../sched_ext/select_cpu_dfl_nodispatch.c | 2 +-
> 16 files changed, 404 insertions(+), 89 deletions(-)
> create mode 100644 tools/testing/selftests/sched_ext/allowed_cpus.bpf.c
> create mode 100644 tools/testing/selftests/sched_ext/allowed_cpus.c
Powered by blists - more mailing lists