[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aKUWoePcNPcnJT1D@slm.duckdns.org>
Date: Tue, 19 Aug 2025 14:28:17 -1000
From: 'Tejun Heo' <tj@...nel.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: liuwenfang <liuwenfang@...or.com>, 'David Vernet' <void@...ifault.com>,
'Andrea Righi' <arighi@...dia.com>,
'Changwoo Min' <changwoo@...lia.com>,
'Ingo Molnar' <mingo@...hat.com>,
'Juri Lelli' <juri.lelli@...hat.com>,
'Vincent Guittot' <vincent.guittot@...aro.org>,
'Dietmar Eggemann' <dietmar.eggemann@....com>,
'Steven Rostedt' <rostedt@...dmis.org>,
'Ben Segall' <bsegall@...gle.com>, 'Mel Gorman' <mgorman@...e.de>,
'Valentin Schneider' <vschneid@...hat.com>,
"'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>,
Joel Fernandes <joelagnelf@...dia.com>
Subject: Re: [PATCH v4 2/3] sched_ext: Fix cpu_released while RT task and SCX
task are scheduled concurrently
Hello, Peter.
(cc'ing Joel for the @rf addition to pick_task())
On Tue, Aug 19, 2025 at 09:47:36AM +0200, Peter Zijlstra wrote:
...
> You're now asking for a 3rd call out to do something like:
>
> ->balance() -- ready a task for pick
> ->pick() -- picks a random other task
> ->put_prev() -- oops, our task didn't get picked, stick it back
>
> Which is bloody ludicrous. So no. We're not doing this.
>
> Why can't pick DTRT ?
This is unfortunate, but, given how things are set up right now, I think we
probably need the last one. Taking a step back and also considering the
proposed @rf addition to pick():
- The reason why SCX needs to do most of its dispatch operations in
balance() is because the kernel side doesn't know which tasks are going to
execute on which CPU until the task is actually picked for execution, so
all picking must be preceded by balance() where tasks can be moved across
rqs.
- There's a gap between balance() and pick_task() where a successful return
from balance() doesn't guarantee that the corresponding pick() would be
called. This seems intentional to guarantee that no matter what happens
during balance(), pick_task() of the highest priority class with a pending
task is guaranteed to get the CPU.
This guarantee changes if we add @rf to pick_task() and let it unlock and
relock. A higher priority task may get queued while the rq lock is
released and then the lower priority pick_task() may still return a task
of its own. This should be resolvable although it may not be completely
trivial. We need to shift clear_tsk_need_resched() before pick_task()'s
and then make wakeup_preempt() would probalby need more complications to
guarantee that resched_curr() is not skipped while scheduling is taking
place.
- SCX's ops.cpu_acquire() and .cpu_release() are to tell the BPF scheduler
that a CPU is available for running SCX tasks or not. We want to tell the
BPF side that a CPU became available before its ops.dispatch() is called -
ie. before balance(). So, IIUC, this is where the problem is. Because
there's a gap between balance() and pick_task(), the CPU might get taken
by a higher priority sched class inbetween. If that happens, we need to
tell the BPF scheduler that it lost the CPU. However, if the previous task
wasn't a SCX one, there's currently no place to tell so.
IOW, SCX needs to invoke ops.cpu_released() when a CPU is taken between
its balance() and pick_task(); however, that can happen when both prev and
next tasks are !SCX tasks, so it needs something which is always called.
If @rf is added to pick_task() so that we can merge balance() into
pick_task(), that'd be simplify these. SCX wouldn't need balance index
boosting and can handle cpu_acquire/release() within pick_task(). What do
you think?
Thanks.
--
tejun
Powered by blists - more mailing lists