linux-kernel - Re: [PATCH v2 1/2] sched_ext: Fix cpu_released while RT task and SCX task are scheduled concurrently

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aHltRzhQjwPsGovj@slm.duckdns.org>
Date: Thu, 17 Jul 2025 11:38:15 -1000
From: 'Tejun Heo' <tj@...nel.org>
To: liuwenfang <liuwenfang@...or.com>
Cc: 'David Vernet' <void@...ifault.com>, 'Andrea Righi' <arighi@...dia.com>,
	'Changwoo Min' <changwoo@...lia.com>,
	'Ingo Molnar' <mingo@...hat.com>,
	'Peter Zijlstra' <peterz@...radead.org>,
	'Juri Lelli' <juri.lelli@...hat.com>,
	'Vincent Guittot' <vincent.guittot@...aro.org>,
	'Dietmar Eggemann' <dietmar.eggemann@....com>,
	'Steven Rostedt' <rostedt@...dmis.org>,
	'Ben Segall' <bsegall@...gle.com>, 'Mel Gorman' <mgorman@...e.de>,
	'Valentin Schneider' <vschneid@...hat.com>,
	"'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 1/2] sched_ext: Fix cpu_released while RT task and SCX
 task are scheduled concurrently

Hello,

My aplogies for really late reply. I've been off work and ended up a lot
more offline than I expected.

On Sat, Jun 28, 2025 at 06:50:32AM +0000, liuwenfang wrote:
> Supposed RT task(RT1) is running on CPU0 and RT task(RT2) is awakened on CPU1,
> RT1 becomes sleep and SCX task(SCX1) will be dispatched to CPU0, RT2 will be
> placed on CPU0:
> 
> CPU0(schedule)                                     CPU1(try_to_wake_up)
> set_current_state(TASK_INTERRUPTIBLE)              try_to_wake_up # RT2
> __schedule                                           select_task_rq # CPU0 is selected
> LOCK rq(0)->lock # lock CPU0 rq                        ttwu_queue
>   deactivate_task # RT1                                  LOCK rq(0)->lock # busy waiting
>     pick_next_task # no more RT tasks on rq                 |
>       prev_balance                                          |
>         balance_scx                                         |
>           balance_one                                       |
>             rq->scx.cpu_released = false;                   |
>               consume_global_dsq                            |
>                 consume_dispatch_q                          |
>                   consume_remote_task                       |
>                     UNLOCK rq(0)->lock                      V
>                                                          LOCK rq(0)->lock # succ
>                     deactivate_task # SCX1               ttwu_do_activate
>                     LOCK rq(0)->lock # busy waiting      activate_task # RT2 equeued
>                        |                                 UNLOCK rq(0)->lock
>                        V
>                     LOCK rq(0)->lock # succ
>                     activate_task # SCX1
>       pick_task # RT2 is picked
>       put_prev_set_next_task # prev is RT1, next is RT2, rq->scx.cpu_released = false;
> UNLOCK rq(0)->lock
> 
> At last, RT2 will be running on CPU0 with rq->scx.cpu_released being false!
> 
> So, Add the scx_next_task_picked () and check sched class again to fix the value
> of rq->scx.cpu_released.

Yeah, the problem and diagnosis look correct to me. It'd be nice if we don't
have to add an explicit hook but ops.cpu_acquire() needs to be called before
dispatching to the CPU and then we can lose while doing ops.pick_task().

> Signed-off-by: l00013971 <l00013971@...onor.com>

Can you please use "FIRST_NAME LAST_NAME <EMAIL>" when signing off?

> -static void switch_class(struct rq *rq, struct task_struct *next)
> +static void switch_class(struct rq *rq, struct task_struct *next, bool prev_on_scx)
>  {
>  	const struct sched_class *next_class = next->sched_class;
>  
> @@ -3197,7 +3197,8 @@ static void switch_class(struct rq *rq, struct task_struct *next)
>  	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
>  	 * resched.
>  	 */
> -	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
> +	if (prev_on_scx)
> +		smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);

It's currently obviously broken as the seq is currently only incremented on
scx -> !scx transitions but it should be called on all transitions. This is
a breakage introduced by b999e365c298 ("sched, sched_ext: Replace
scx_next_task_picked() with sched_class->switch_class()").

> +void scx_next_task_picked(struct rq *rq, struct task_struct *prev,
> +			  struct task_struct *next)
> +{
> +	bool prev_on_scx = prev && (prev->sched_class == &ext_sched_class);

I don't think @prev or @next can ever be NULL, can they?

> +
> +	if (!scx_enabled() ||

Let's make this an inline function in ext.h. The pnt_seq update should be
moved here after scx_enabled() test, I think. This probably should be a
separate patch.

> +	    !next ||
> +	    next->sched_class == &ext_sched_class)
> +		return;
> +
> +	switch_class(rq, next, prev_on_scx);
> +}
>
>  static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
>  			      struct task_struct *next)
>  {
> @@ -3253,7 +3267,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
>  		 */
>  		if (p->scx.slice && !scx_rq_bypassing(rq)) {
>  			dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
> -			goto switch_class;
> +			return;
...
> @@ -2465,6 +2468,8 @@ static inline void put_prev_set_next_task(struct rq *rq,
>  
>  	__put_prev_set_next_dl_server(rq, prev, next);
>  
> +	scx_next_task_picked(rq, prev, next);

It's a bit unfortunate that we need to add this hook but I can't see another
way around it for both the problem you're reporting and the pnt_seq issue.
Maybe name it scx_put_prev_set_next(rq, prev, next) for consistency?

Thanks.

-- 
tejun