linux-kernel - 回复: [PATCH] sched/rt: rto_next_cpu: Skip CPUs with NEED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4b60e303c2ac4fa0b6dc51e629427492@huawei.com>
Date: Tue, 25 Nov 2025 07:26:36 +0000
From: chenjinghuang <chenjinghuang2@...wei.com>
To: Steven Rostedt <rostedt@...dmis.org>
CC: "mingo@...hat.com" <mingo@...hat.com>, "peterz@...radead.org"
	<peterz@...radead.org>, "juri.lelli@...hat.com" <juri.lelli@...hat.com>,
	"vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
	"dietmar.eggemann@....com" <dietmar.eggemann@....com>, "bsegall@...gle.com"
	<bsegall@...gle.com>, "mgorman@...e.de" <mgorman@...e.de>,
	"vschneid@...hat.com" <vschneid@...hat.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: 回复: [PATCH] sched/rt: rto_next_cpu: Skip CPUs with NEED_RESCHED



-----邮件原件-----
发件人: Steven Rostedt <rostedt@...dmis.org> 
发送时间: 2025年11月22日 1:38
收件人: chenjinghuang <chenjinghuang2@...wei.com>
抄送: mingo@...hat.com; peterz@...radead.org; juri.lelli@...hat.com; vincent.guittot@...aro.org; dietmar.eggemann@....com; bsegall@...gle.com; mgorman@...e.de; vschneid@...hat.com; linux-kernel@...r.kernel.org
主题: Re: [PATCH] sched/rt: rto_next_cpu: Skip CPUs with NEED_RESCHED

On Fri, 21 Nov 2025 01:40:04 +0000
Chen Jinghuang <chenjinghuang2@...wei.com> wrote:

> CPU0 becomes overloaded when hosting a CPU-bound RT task, a 
> non-CPU-bound RT task, and a CFS task stuck in kernel space. When 
> other CPUs switch from RT to non-RT tasks, RT load balancing (LB) is 
> triggered; with HAVE_RT_PUSH_IPI enabled, they send IPIs to CPU0 to 
> drive the execution of rto_push_irq_work_func. During push_rt_task on 
> CPU0, if next_task->prio < rq->donor->prio, resched_curr() sets 
> NEED_RESCHED and after the push operation completes, CPU0 calls rto_next_cpu().
> Since only CPU0 is overloaded in this scenario, rto_next_cpu() should 
> ideally return -1 (no further IPI needed).
> 
> However, multiple CPUs invoking tell_cpu_to_push() during LB 
> increments
> rd->rto_loop_next. Even when rd->rto_cpu is set to -1, the mismatch 
> rd->between rto_loop and rd->rto_loop_next forces rto next_cpu() to 
> rd->restart its
> search from -1. With CPU0 remaining overloaded(""satisfying 
> rt_nr_migratory && rt_nr_total > 1), it gets reselected, causing CPU0 
> to queue irq_work to itself and send self-IPIs repeatedly. As long as 
> CPU0 stays overloaded and other CPUs run pull_rt_tasks(), it falls 
> into an infinite self-IPI loop, wasting CPU cycles on unnecessary interrupt handling.

Is it truly "infinite", or just wasted due to other CPUs requesting a pull?

Also, it appears the issue here is that it's sending to itself.

The IPI explosion in this scenario is caused by two combined factors-cross-CPU 
IPIs triggered by other CPUs repeatedly initiating pull_rt_tasks(), and self-IPIs sent by
CPU0 after reselecting itself in rto_next_cpu(). These two factors form a chain reaction, 
resulting in a "de facto infinite stream of redundant IPIs" while CPU0 remains overloaded.

> 
> The triggering scenario is as follows:
> 
>          cpu0	        	   cpu1               	      cpu2
>                    	        pull_rt_task
> 	                      tell_cpu_to_push
>                  <------------irq_work_queue_on rto_push_irq_work_func
>        push_rt_task
>     resched_curr(rq)                                      pull_rt_task
>     rto_next_cpu                                        tell_cpu_to_push
>      			 <-------------------------- atomic_inc(rto_loop_next)
> rd->rto_loop != next
>      rto_next_cpu
>    irq_work_queue_on
> rto_push_irq_work_func
> 
> Fix redundant self-IPI/cross-CPU IPI when target CPU already has a 
> pending reschedule, making the IPI unnecessary.
> 
> Signed-off-by: Chen Jinghuang <chenjinghuang2@...wei.com>
> ---
>  kernel/sched/rt.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 
> 7936d4333731..29ce1af9f121 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2123,8 +2123,20 @@ static int rto_next_cpu(struct root_domain *rd)
>  
>  		rd->rto_cpu = cpu;
>  
> -		if (cpu < nr_cpu_ids)
> +		if (cpu < nr_cpu_ids) {
> +			struct task_struct *t;
> +			struct rq *rq = cpu_rq(cpu);
> +
> +			rcu_read_lock();
> +			t = rcu_dereference(rq->curr);
> +			if (test_tsk_need_resched(t)) {
> +				rcu_read_unlock();
> +				continue;
> +			}
> +			rcu_read_unlock();
> +
>  			return cpu;
> +		}
>  
>  		rd->rto_cpu = -1;
>  

Instead of skipping need resched, would skipping the current CPU work too?

Acknowledge that "sending IPI to itself" is the direct trigger for the loop. The 
original approach of checking NEED_RESCHED was an indirect optimization 
that did not address the core issue.

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 7936d4333731..cacd8912cd31 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2100,6 +2100,7 @@ static void push_rt_tasks(struct rq *rq)
  */
 static int rto_next_cpu(struct root_domain *rd)  {
+	int this_cpu = smp_processor_id();
 	int next;
 	int cpu;
 
@@ -2118,10 +2119,13 @@ static int rto_next_cpu(struct root_domain *rd)
 	 */
 	for (;;) {
 
-		/* When rto_cpu is -1 this acts like cpumask_first() */
-		cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
+		do {
+			/* When rto_cpu is -1 this acts like cpumask_first() */
+			cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
+			rd->rto_cpu = cpu;
 
-		rd->rto_cpu = cpu;
+			/* Do not send IPI to self */
+		} while (cpu == this_cpu);
 
 		if (cpu < nr_cpu_ids)
 			return cpu;

-- Steve