linux-kernel - Re: [PATCHSET v5 sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with allowed CPUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <778d935f-ef77-4bac-aeff-1bafa91b825e@nvidia.com>
Date: Thu, 20 Mar 2025 15:05:37 +0100
From: Joel Fernandes <joelagnelf@...dia.com>
To: Andrea Righi <arighi@...dia.com>, Tejun Heo <tj@...nel.org>,
 David Vernet <void@...ifault.com>, Changwoo Min <changwoo@...lia.com>
Cc: linux-kernel@...r.kernel.org
Subject: Re: [PATCHSET v5 sched_ext/for-6.15] sched_ext: Enhance built-in idle
 selection with allowed CPUs



On 3/20/2025 8:36 AM, Andrea Righi wrote:
> Many scx schedulers implement their own hard or soft-affinity rules to
> support topology characteristics, such as heterogeneous architectures
> (e.g., big.LITTLE, P-cores/E-cores), or to categorize tasks based on
> specific properties (e.g., running certain tasks only in a subset of CPUs).
> 
> Currently, there is no mechanism that allows to use the built-in idle CPU
> selection policy to an arbitrary subset of CPUs. As a result, schedulers
> often implement their own idle CPU selection policies, which are typically
> similar to one another, leading to a lot of code duplication.
> 
> To address this, extend the built-in idle CPU selection policy introducing
> the concept of allowed CPUs.
> 
> With this concept, BPF schedulers can apply the built-in idle CPU selection
> policy to a subset of allowed CPUs, allowing them to implement their own
> hard/soft-affinity rules while still using the topology optimizations of
> the built-in policy, preventing code duplication across different
> schedulers.
> 
> To implement this introduce a new helper kfunc scx_bpf_select_cpu_and()
> that accepts a cpumask of allowed CPUs:
> 
> s32 scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu,
> 			   u64 wake_flags,
> 			   const struct cpumask *cpus_allowed, u64 flags);
> 
> Example usage
> =============
> 
> s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p,
> 		   s32 prev_cpu, u64 wake_flags)
> {
> 	const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr;
> 	s32 cpu;

Andrea, I'm curious why cannot this expression simply be moved into the default
select implementation? And then for those that need a more custom mask, we can
do the scx_bpf_select_cpu_and() as a second step.

Also I think I am missing, what is the motivation in the existing code to not do
LLC/NUMA-only scans if the task is restrained? Thanks for clarifying.

thanks,

 - Joel



> 
> 	cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0);
> 	if (cpu >= 0) {
> 		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
> 		return cpu;
> 	}
> 
> 	return prev_cpu;
> }
> 
> Results
> =======
> 
> Load distribution on a 4 sockets / 4 cores per socket system, simulated
> using virtme-ng, running a modified version of scx_bpfland that uses the
> new helper scx_bpf_select_cpu_and() and 0xff00 as allowed domain:
> 
>      $ vng --cpu 16,sockets=4,cores=4,threads=1
>      ...
>      $ stress-ng -c 16
>      ...
>      $ htop
>      ...
>        0[                         0.0%]   8[||||||||||||||||||||||||100.0%]
>        1[                         0.0%]   9[||||||||||||||||||||||||100.0%]
>        2[                         0.0%]  10[||||||||||||||||||||||||100.0%]
>        3[                         0.0%]  11[||||||||||||||||||||||||100.0%]
>        4[                         0.0%]  12[||||||||||||||||||||||||100.0%]
>        5[                         0.0%]  13[||||||||||||||||||||||||100.0%]
>        6[                         0.0%]  14[||||||||||||||||||||||||100.0%]
>        7[                         0.0%]  15[||||||||||||||||||||||||100.0%]
> 
> With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across all
> the available CPUs.
> 
> ChangeLog v4 -> v5:
>  - simplify the code to compute (and) task's temporary cpumasks
> 
> ChangeLog v3 -> v4:
>  - keep p->nr_cpus_allowed optimizations (skip cpumask operations when the
>    task can run on all CPUs)
>  - allow to call scx_bpf_select_cpu_and() also from ops.enqueue() and
>    modify the kselftest to cover this case as well
>  - rebase to the latest sched_ext/for-6.15
> 
> ChangeLog v2 -> v3:
>  - incrementally refactor scx_select_cpu_dfl() to accept idle flags and an
>    arbitrary allowed cpumask
>  - build scx_bpf_select_cpu_and() on top of the existing logic
>  - re-arrange scx_select_cpu_dfl() prototype, aligning the first three
>    arguments with select_task_rq()
>  - do not use "domain" for the allowed cpumask to avoid potential ambiguity
>    with sched_domain
> 
> ChangeLog v1 -> v2:
>   - rename scx_bpf_select_cpu_pref() to scx_bpf_select_cpu_and() and always
>     select idle CPUs strictly within the allowed domain
>   - rename preferred CPUs -> allowed CPU
>   - drop %SCX_PICK_IDLE_IN_PREF (not required anymore)
>   - deprecate scx_bpf_select_cpu_dfl() in favor of scx_bpf_select_cpu_and()
>     and provide all the required backward compatibility boilerplate
> 
> Andrea Righi (6):
>       sched_ext: idle: Extend topology optimizations to all tasks
>       sched_ext: idle: Explicitly pass allowed cpumask to scx_select_cpu_dfl()
>       sched_ext: idle: Accept an arbitrary cpumask in scx_select_cpu_dfl()
>       sched_ext: idle: Introduce scx_bpf_select_cpu_and()
>       selftests/sched_ext: Add test for scx_bpf_select_cpu_and()
>       sched_ext: idle: Deprecate scx_bpf_select_cpu_dfl()
> 
>  Documentation/scheduler/sched-ext.rst              |  11 +-
>  kernel/sched/ext.c                                 |   6 +-
>  kernel/sched/ext_idle.c                            | 196 ++++++++++++++++-----
>  kernel/sched/ext_idle.h                            |   3 +-
>  tools/sched_ext/include/scx/common.bpf.h           |   5 +-
>  tools/sched_ext/include/scx/compat.bpf.h           |  37 ++++
>  tools/sched_ext/scx_flatcg.bpf.c                   |  12 +-
>  tools/sched_ext/scx_simple.bpf.c                   |   9 +-
>  tools/testing/selftests/sched_ext/Makefile         |   1 +
>  .../testing/selftests/sched_ext/allowed_cpus.bpf.c | 121 +++++++++++++
>  tools/testing/selftests/sched_ext/allowed_cpus.c   |  57 ++++++
>  .../selftests/sched_ext/enq_select_cpu_fails.bpf.c |  12 +-
>  .../selftests/sched_ext/enq_select_cpu_fails.c     |   2 +-
>  tools/testing/selftests/sched_ext/exit.bpf.c       |   6 +-
>  .../sched_ext/select_cpu_dfl_nodispatch.bpf.c      |  13 +-
>  .../sched_ext/select_cpu_dfl_nodispatch.c          |   2 +-
>  16 files changed, 404 insertions(+), 89 deletions(-)
>  create mode 100644 tools/testing/selftests/sched_ext/allowed_cpus.bpf.c
>  create mode 100644 tools/testing/selftests/sched_ext/allowed_cpus.c