linux-kernel - Re: [PATCH 6/8] sched_ext: idle: Introduce scx_bpf_select_cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b350d05c-3279-4d62-86fe-555ef0985f03@igalia.com>
Date: Sat, 15 Mar 2025 10:35:03 +0900
From: Changwoo Min <changwoo@...lia.com>
To: Andrea Righi <arighi@...dia.com>, Tejun Heo <tj@...nel.org>,
 David Vernet <void@...ifault.com>
Cc: bpf@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 6/8] sched_ext: idle: Introduce scx_bpf_select_cpu_and()

Hi Andrea,

Thank you for the great work! I like it.

On 3/14/25 18:45, Andrea Righi wrote:
> Provide a new kfunc, scx_bpf_select_cpu_and(), that can be used to apply
> the built-in idle CPU selection policy to a subset of allowed CPU.
> 
> This new helper is basically an extension of scx_bpf_select_cpu_dfl().
> However, when an idle CPU can't be found, it returns a negative value
> instead of @prev_cpu, aligning its behavior more closely with
> scx_bpf_pick_idle_cpu().
> 
> It also accepts %SCX_PICK_IDLE_* flags, which can be used to enforce
> strict selection to @prev_cpu's node (%SCX_PICK_IDLE_IN_NODE), or to
> request only a full-idle SMT core (%SCX_PICK_IDLE_CORE), while applying
> the built-in selection logic.
> 
> With this helper, BPF schedulers can apply the built-in idle CPU
> selection policy restricted to any arbitrary subset of CPUs.
> 
> Example usage
> =============
> 
> Possible usage in ops.select_cpu():
> 
> s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p,
> 		   s32 prev_cpu, u64 wake_flags)
> {
> 	const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr;
> 	s32 cpu;
> 
> 	cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0);
> 	if (cpu >= 0) {
> 		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
> 		return cpu;
> 	}
> 
> 	return prev_cpu;
> }
> 
> Results
> =======
> 
> Load distribution on a 4 sockets, 4 cores per socket system, simulated
> using virtme-ng, running a modified version of scx_bpfland that uses
> scx_bpf_select_cpu_and() with 0xff00 as the allowed subset of CPUs:
> 
>   $ vng --cpu 16,sockets=4,cores=4,threads=1
>   ...
>   $ stress-ng -c 16
>   ...
>   $ htop
>   ...
>     0[                         0.0%]   8[||||||||||||||||||||||||100.0%]
>     1[                         0.0%]   9[||||||||||||||||||||||||100.0%]
>     2[                         0.0%]  10[||||||||||||||||||||||||100.0%]
>     3[                         0.0%]  11[||||||||||||||||||||||||100.0%]
>     4[                         0.0%]  12[||||||||||||||||||||||||100.0%]
>     5[                         0.0%]  13[||||||||||||||||||||||||100.0%]
>     6[                         0.0%]  14[||||||||||||||||||||||||100.0%]
>     7[                         0.0%]  15[||||||||||||||||||||||||100.0%]
> 
> With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across
> all the available CPUs.
> 
> Signed-off-by: Andrea Righi <arighi@...dia.com>
> ---
>   kernel/sched/ext.c                       |  1 +
>   kernel/sched/ext_idle.c                  | 41 ++++++++++++++++++++++++
>   tools/sched_ext/include/scx/common.bpf.h |  2 ++
>   3 files changed, 44 insertions(+)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index f42352e8d889e..343f066c1185d 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -465,6 +465,7 @@ struct sched_ext_ops {
>   	 * idle CPU tracking and the following helpers become unavailable:
>   	 *
>   	 * - scx_bpf_select_cpu_dfl()
> +	 * - scx_bpf_select_cpu_and()
>   	 * - scx_bpf_test_and_clear_cpu_idle()
>   	 * - scx_bpf_pick_idle_cpu()
>   	 *
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> index 549551bc97a7b..c0de7b64771d4 100644
> --- a/kernel/sched/ext_idle.c
> +++ b/kernel/sched/ext_idle.c
> @@ -914,6 +914,46 @@ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>   	return prev_cpu;
>   }
>   
> +/**
> + * scx_bpf_select_cpu_and - Pick an idle CPU usable by task @p,
> + *			    prioritizing those in @cpus_allowed
> + * @p: task_struct to select a CPU for
> + * @prev_cpu: CPU @p was on previously
> + * @wake_flags: %SCX_WAKE_* flags
> + * @cpus_allowed: cpumask of allowed CPUs
> + * @flags: %SCX_PICK_IDLE* flags
> + *
> + * Can only be called from ops.select_cpu() if the built-in CPU selection is
> + * enabled - ops.update_idle() is missing or %SCX_OPS_KEEP_BUILTIN_IDLE is set.
> + * @p, @prev_cpu and @wake_flags match ops.select_cpu().

I think that scx_bpf_select_cpu_and () needs to be allowed to
call from ops.enqueue(). That is because many scx schedulers have
some logic similar to scx_bpf_select_cpu_dfl() to kick an idle
CPU proactively.

> + *
> + * Returns the selected idle CPU, which will be automatically awakened upon
> + * returning from ops.select_cpu() and can be used for direct dispatch, or
> + * a negative value if no idle CPU is available.
> + */
> +__bpf_kfunc s32 scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu, u64 wake_flags,
> +				       const struct cpumask *cpus_allowed, u64 flags)
> +{
> +	s32 cpu;
> +
> +	if (!ops_cpu_valid(prev_cpu, NULL))
> +		return -EINVAL;
> +
> +	if (!check_builtin_idle_enabled())
> +		return -EBUSY;
> +
> +	if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
> +		return -EPERM;
> +
> +#ifdef CONFIG_SMP
> +	cpu = scx_select_cpu_dfl(p, prev_cpu, wake_flags, cpus_allowed, flags);
> +#else
> +	cpu = -EBUSY;
> +#endif
> +
> +	return cpu;
> +}
> +
>   /**
>    * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the
>    * idle-tracking per-CPU cpumask of a target NUMA node.
> @@ -1222,6 +1262,7 @@ static const struct btf_kfunc_id_set scx_kfunc_set_idle = {
>   
>   BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
>   BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU)
>   BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
>   
>   static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = {
> diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
> index dc4333d23189f..6f1da61cf7f17 100644
> --- a/tools/sched_ext/include/scx/common.bpf.h
> +++ b/tools/sched_ext/include/scx/common.bpf.h
> @@ -48,6 +48,8 @@ static inline void ___vmlinux_h_sanity_check___(void)
>   
>   s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) __ksym;
>   s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool *is_idle) __ksym;
> +s32 scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu, u64 wake_flags,
> +			   const struct cpumask *cpus_allowed, u64 flags) __ksym __weak;
>   void scx_bpf_dsq_insert(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flags) __ksym __weak;
>   void scx_bpf_dsq_insert_vtime(struct task_struct *p, u64 dsq_id, u64 slice, u64 vtime, u64 enq_flags) __ksym __weak;
>   u32 scx_bpf_dispatch_nr_slots(void) __ksym;

Regards,
Changwoo Min