[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPiLHWVf0Vp1qUzV@gpd4>
Date: Wed, 22 Oct 2025 09:43:25 +0200
From: Andrea Righi <arighi@...dia.com>
To: Tejun Heo <tj@...nel.org>
Cc: David Vernet <void@...ifault.com>, Changwoo Min <changwoo@...lia.com>,
Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org,
sched-ext@...ts.linux.dev, bpf@...r.kernel.org,
Wen-Fang Liu <liuwenfang@...or.com>
Subject: Re: sched_ext: Fix SCX_KICK_WAIT to work reliably
Hi Tejun,
On Tue, Oct 21, 2025 at 11:03:54AM -1000, Tejun Heo wrote:
> SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete
> a reschedule and can be used to implement operations like core scheduling.
>
> This used to be implemented by scx_next_task_picked() incrementing pnt_seq,
> which was always called when a CPU picks the next task to run, allowing
> SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and
> pick the next task.
>
> However, commit b999e365c298 ("sched_ext: Replace scx_next_task_picked()
> with switch_class()") replaced scx_next_task_picked() with the
> switch_class() callback, which is only called when switching between sched
> classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be
> reliably incremented unless the previous task was SCX and the next task was
> not.
>
> This fix leverages commit 4c95380701f5 ("sched/ext: Fold balance_scx() into
> pick_task_scx()") which refactored the pick path making put_prev_task_scx()
> the natural place to track task switches for SCX_KICK_WAIT. The fix moves
> pnt_seq increment to put_prev_task_scx() and refines the semantics: If the
> current task on the target CPU is SCX, SCX_KICK_WAIT waits until that task
> switches out. This provides sufficient guarantee for use cases like core
> scheduling while keeping the operation self-contained within SCX.
>
> Reported-by: Wen-Fang Liu <liuwenfang@...or.com>
> Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com
> Cc: Peter Zijlstra <peterz@...radead.org>
> Signed-off-by: Tejun Heo <tj@...nel.org>
> ---
> kernel/sched/ext.c | 31 ++++++++++++++++++-------------
> kernel/sched/ext_internal.h | 6 ++++--
> 2 files changed, 22 insertions(+), 15 deletions(-)
>
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2260,12 +2260,6 @@ static void switch_class(struct rq *rq,
> struct scx_sched *sch = scx_root;
> const struct sched_class *next_class = next->sched_class;
>
> - /*
> - * Pairs with the smp_load_acquire() issued by a CPU in
> - * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
> - * resched.
> - */
> - smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
> if (!(sch->ops.flags & SCX_OPS_HAS_CPU_PREEMPT))
> return;
>
> @@ -2305,6 +2299,14 @@ static void put_prev_task_scx(struct rq
> struct task_struct *next)
> {
> struct scx_sched *sch = scx_root;
> +
> + /*
> + * Pairs with the smp_load_acquire() issued by a CPU in
> + * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
> + * resched.
> + */
> + smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
> +
> update_curr_scx(rq);
>
> /* see dequeue_task_scx() on why we skip when !QUEUED */
> @@ -5144,8 +5146,12 @@ static bool kick_one_cpu(s32 cpu, struct
> }
>
> if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) {
> - pseqs[cpu] = rq->scx.pnt_seq;
> - should_wait = true;
> + if (cur_class == &ext_sched_class) {
> + pseqs[cpu] = rq->scx.pnt_seq;
> + should_wait = true;
> + } else {
> + cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
> + }
> }
>
> resched_curr(rq);
> @@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct
>
> if (cpu != cpu_of(this_rq)) {
It's probably fine anyway, but should we check for cpu_online(cpu) here?
> /*
> - * Pairs with smp_store_release() issued by this CPU in
> - * switch_class() on the resched path.
> + * Pairs with store_release in put_prev_task_scx().
> *
> - * We busy-wait here to guarantee that no other task can
> - * be scheduled on our core before the target CPU has
> - * entered the resched path.
> + * We busy-wait here to guarantee that the task running
> + * at the time of kicking is no longer running. This can
> + * be used to implement e.g. core scheduling.
> */
> while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
> cpu_relax();
I'm wondering if we can break the semantic if cpu_rq(cpu)->curr->scx.slice
is refilled concurrently between kick_one_cpu() and this busy wait. In this
case we return, because wait_pnt_seq is incremented, but we keep running
the same task.
Should we introduce a flag (or something similar) to force the re-enqueue
of the prev task in this case?
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -997,8 +997,10 @@ enum scx_kick_flags {
> SCX_KICK_PREEMPT = 1LLU << 1,
>
> /*
> - * Wait for the CPU to be rescheduled. The scx_bpf_kick_cpu() call will
> - * return after the target CPU finishes picking the next task.
> + * The scx_bpf_kick_cpu() call will return after the current SCX task of
> + * the target CPU switches out. This can be used to implement e.g. core
> + * scheduling. This has no effect if the current task on the target CPU
> + * is not on SCX.
> */
> SCX_KICK_WAIT = 1LLU << 2,
> };
Thanks,
-Andrea
Powered by blists - more mailing lists