linux-kernel - RE: [PATCH v2 1/2] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0144ab66963248cf8587c47bf900aabb@honor.com>
Date: Sun, 20 Jul 2025 09:20:22 +0000
From: liuwenfang <liuwenfang@...or.com>
To: 'Tejun Heo' <tj@...nel.org>
CC: 'David Vernet' <void@...ifault.com>, 'Andrea Righi' <arighi@...dia.com>,
	'Changwoo Min' <changwoo@...lia.com>, 'Ingo Molnar' <mingo@...hat.com>,
	'Peter Zijlstra' <peterz@...radead.org>, 'Juri Lelli'
	<juri.lelli@...hat.com>, 'Vincent Guittot' <vincent.guittot@...aro.org>,
	'Dietmar Eggemann' <dietmar.eggemann@....com>, 'Steven Rostedt'
	<rostedt@...dmis.org>, 'Ben Segall' <bsegall@...gle.com>, 'Mel Gorman'
	<mgorman@...e.de>, 'Valentin Schneider' <vschneid@...hat.com>,
	"'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v2 1/2] sched_ext: Fix cpu_released while RT task and SCX
 task are scheduled concurrently

Thanks for your feedback.

> 
> Hello,
> 
> My aplogies for really late reply. I've been off work and ended up a lot more
> offline than I expected.
> 
> On Sat, Jun 28, 2025 at 06:50:32AM +0000, liuwenfang wrote:
> > Supposed RT task(RT1) is running on CPU0 and RT task(RT2) is awakened
> > on CPU1,
> > RT1 becomes sleep and SCX task(SCX1) will be dispatched to CPU0, RT2
> > will be placed on CPU0:
> >
> > CPU0(schedule)
> CPU1(try_to_wake_up)
> > set_current_state(TASK_INTERRUPTIBLE)              try_to_wake_up #
> RT2
> > __schedule
> select_task_rq # CPU0 is selected
> > LOCK rq(0)->lock # lock CPU0 rq                        ttwu_queue
> >   deactivate_task # RT1                                  LOCK
> rq(0)->lock # busy waiting
> >     pick_next_task # no more RT tasks on rq                 |
> >       prev_balance                                          |
> >         balance_scx                                         |
> >           balance_one                                       |
> >             rq->scx.cpu_released = false;                   |
> >               consume_global_dsq                            |
> >                 consume_dispatch_q                          |
> >                   consume_remote_task                       |
> >                     UNLOCK rq(0)->lock                      V
> >                                                          LOCK
> rq(0)->lock # succ
> >                     deactivate_task # SCX1
> ttwu_do_activate
> >                     LOCK rq(0)->lock # busy waiting      activate_task
> # RT2 equeued
> >                        |
> UNLOCK rq(0)->lock
> >                        V
> >                     LOCK rq(0)->lock # succ
> >                     activate_task # SCX1
> >       pick_task # RT2 is picked
> >       put_prev_set_next_task # prev is RT1, next is RT2,
> > rq->scx.cpu_released = false; UNLOCK rq(0)->lock
> >
> > At last, RT2 will be running on CPU0 with rq->scx.cpu_released being false!
> >
> > So, Add the scx_next_task_picked () and check sched class again to fix
> > the value of rq->scx.cpu_released.
> 
> Yeah, the problem and diagnosis look correct to me. It'd be nice if we don't have
> to add an explicit hook but ops.cpu_acquire() needs to be called before
> dispatching to the CPU and then we can lose while doing ops.pick_task().
> 
> > Signed-off-by: l00013971 <l00013971@...onor.com>
> 
> Can you please use "FIRST_NAME LAST_NAME <EMAIL>" when signing off?
> 
> > -static void switch_class(struct rq *rq, struct task_struct *next)
> > +static void switch_class(struct rq *rq, struct task_struct *next,
> > +bool prev_on_scx)
> >  {
> >  	const struct sched_class *next_class = next->sched_class;
> >
> > @@ -3197,7 +3197,8 @@ static void switch_class(struct rq *rq, struct
> task_struct *next)
> >  	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
> >  	 * resched.
> >  	 */
> > -	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
> > +	if (prev_on_scx)
> > +		smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
> 
> It's currently obviously broken as the seq is currently only incremented on scx
> -> !scx transitions but it should be called on all transitions. This is a breakage
> introduced by b999e365c298 ("sched, sched_ext: Replace
> scx_next_task_picked() with sched_class->switch_class()").
Thanks for the suggestion.
> 
> > +void scx_next_task_picked(struct rq *rq, struct task_struct *prev,
> > +			  struct task_struct *next)
> > +{
> > +	bool prev_on_scx = prev && (prev->sched_class == &ext_sched_class);
> 
> I don't think @prev or @next can ever be NULL, can they?
@prev always has valid value in core scheduler routine.
> 
> > +
> > +	if (!scx_enabled() ||
> 
> Let's make this an inline function in ext.h. The pnt_seq update should be moved
> here after scx_enabled() test, I think. This probably should be a separate patch.
Makes sense.  Thanks for the suggestion.
> 
> > +	    !next ||
> > +	    next->sched_class == &ext_sched_class)
> > +		return;
> > +
> > +	switch_class(rq, next, prev_on_scx); }
> >
> >  static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
> >  			      struct task_struct *next)
> >  {
> > @@ -3253,7 +3267,7 @@ static void put_prev_task_scx(struct rq *rq, struct
> task_struct *p,
> >  		 */
> >  		if (p->scx.slice && !scx_rq_bypassing(rq)) {
> >  			dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
> > -			goto switch_class;
> > +			return;
> ...
> > @@ -2465,6 +2468,8 @@ static inline void put_prev_set_next_task(struct
> > rq *rq,
> >
> >  	__put_prev_set_next_dl_server(rq, prev, next);
> >
> > +	scx_next_task_picked(rq, prev, next);
> 
> It's a bit unfortunate that we need to add this hook but I can't see another way
> around it for both the problem you're reporting and the pnt_seq issue.
> Maybe name it scx_put_prev_set_next(rq, prev, next) for consistency?
Makes sense.  Thanks for the suggestion.
> 
> Thanks.
> 
> --
> Tejun
-- 
Regards.
wenfang