[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aCI9_GBpky0cowH9@gpd3>
Date: Mon, 12 May 2025 20:29:16 +0200
From: Andrea Righi <arighi@...dia.com>
To: David Vernet <void@...ifault.com>
Cc: Tejun Heo <tj@...nel.org>, Changwoo Min <changwoo@...lia.com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] sched_ext/idle: Make scx_bpf_select_cpu_and() usable
from any context
Hi David,
On Mon, May 12, 2025 at 11:58:29AM -0500, David Vernet wrote:
> On Mon, May 12, 2025 at 05:14:56PM +0200, Andrea Righi wrote:
> > Allow scx_bpf_select_cpu_and() to be used from any context, not just
> > from ops.enqueue() or ops.select_cpu().
> >
> > This enables schedulers, including user-space ones, to implement a
> > consistent idle CPU selection policy and helps reduce code duplication.
> >
> > Signed-off-by: Andrea Righi <arighi@...dia.com>
>
> Hi Andrea,
>
> > ---
> > kernel/sched/ext_idle.c | 23 +++++++++++++++++------
> > 1 file changed, 17 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> > index 6915685cd3d6b..5373132db02b6 100644
> > --- a/kernel/sched/ext_idle.c
> > +++ b/kernel/sched/ext_idle.c
> > @@ -943,10 +943,15 @@ __bpf_kfunc s32 scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu, u64
>
> FYI it looks like there are a few comments sprinkled around here that specify
> that only the .select_cpu() and .enqueue() ops are allowed.
Ah, good catch! I'll fix the comments.
>
> > if (!check_builtin_idle_enabled())
> > return -EBUSY;
> >
> > - if (!scx_kf_allowed(SCX_KF_SELECT_CPU | SCX_KF_ENQUEUE))
> > - return -EPERM;
> > -
> > #ifdef CONFIG_SMP
> > + /*
> > + * If called from an unlocked context, try to acquire
> > + * cpus_read_lock() to avoid races with CPU hotplug.
> > + */
> > + if (scx_kf_allowed_if_unlocked())
> > + if (!cpus_read_trylock())
> > + return -EBUSY;
>
> Hmm, so this now assumes that this function can be called without the rq lock
> held. I'm not sure if that's safe, as scx_select_cpu_dfl() queries p->cpus_ptr,
> which is protected by the rq lock. Additionally, scx_bpf_select_cpu_and()
> checks p->nr_cpus_allowed which is protected sometimes by pi_lock, sometimes by
> the rq lock, etc depending on its state.
Yeah, it makes locking quite tricky. Maybe we can acquire the rq lock if
called from an unlocked context, instead of cpu_read_lock(), but then we
still have to deal with pi_lock.
>
> This might be fine in practice as I _think_ the only unsafe thing would be if
> p->cpus_ptr could have a torn read or write. scx_select_cpu_dfl() does do
> preempt_disable() so no issues in querying the per-cpu variables there.
I've been running `stress-ng --race-sched N` for a while to stress test
affinity changes and I haven't triggered any error, but that doesn't mean
the code is safe...
>
> As long as this is safe (or can be made safe) I don't have a strong opinion one
> way or the other. I think it's probably a good idea to allow users to remove
> code duplication, and in general it's up to the user to use this API correctly
> (e.g. if you're preempted during a call to scx_bpf_select_cpu_and() and
> prev_cpu changes out from under you due to the task being migrated, that's
> likely just a bug in the scheduler).
Right, I guess the biggest issue here is dealing with locking in a safe
way. About the scheduler not being 100% accurate with prev_cpu, etc. that's
a non-problem I think, at the end we don't need to be formally perfect, we
can tolerate little errors if this API is used from a non-atomic context.
I'll think more about it and do some experiemnts with locking.
Thanks!
-Andrea
>
> Thanks,
> David
>
> > /*
> > * This may also be called from ops.enqueue(), so we need to handle
> > * per-CPU tasks as well. For these tasks, we can skip all idle CPU
> > @@ -956,10 +961,16 @@ __bpf_kfunc s32 scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu, u64
> > if (p->nr_cpus_allowed == 1) {
> > if (cpumask_test_cpu(prev_cpu, cpus_allowed) &&
> > scx_idle_test_and_clear_cpu(prev_cpu))
> > - return prev_cpu;
> > - return -EBUSY;
> > + cpu = prev_cpu;
> > + else
> > + cpu = -EBUSY;
> > + goto out_unlock;
> > }
> > cpu = scx_select_cpu_dfl(p, prev_cpu, wake_flags, cpus_allowed, flags);
> > +
> > +out_unlock:
> > + if (scx_kf_allowed_if_unlocked())
> > + cpus_read_unlock();
> > #else
> > cpu = -EBUSY;
> > #endif
> > @@ -1266,6 +1277,7 @@ BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
> > BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
> > BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu_node, KF_RCU)
> > BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
> > +BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU)
> > BTF_KFUNCS_END(scx_kfunc_ids_idle)
> >
> > static const struct btf_kfunc_id_set scx_kfunc_set_idle = {
> > @@ -1275,7 +1287,6 @@ static const struct btf_kfunc_id_set scx_kfunc_set_idle = {
> >
> > BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
> > BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
> > -BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU)
> > BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
> >
> > static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = {
> > --
> > 2.49.0
> >
Powered by blists - more mailing lists