linux-kernel - Re: [PATCH] sched/fair: do not scan twice in detach

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ef64a101-c983-408e-b738-4fde57901479@amperemail.onmicrosoft.com>
Date: Thu, 17 Jul 2025 10:56:55 +0800
From: Shijie Huang <shijie@...eremail.onmicrosoft.com>
To: Valentin Schneider <vschneid@...hat.com>,
 Huang Shijie <shijie@...amperecomputing.com>, mingo@...hat.com,
 peterz@...radead.org, juri.lelli@...hat.com, vincent.guittot@...aro.org
Cc: patches@...erecomputing.com, cl@...ux.com,
 Shubhang@...amperecomputing.com, dietmar.eggemann@....com,
 rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/fair: do not scan twice in detach_tasks()


On 2025/7/16 23:08, Valentin Schneider wrote:
> On 16/07/25 10:13, Shijie Huang wrote:
>> On 2025/7/16 1:04, Valentin Schneider wrote:
>>> On 07/07/25 16:36, Huang Shijie wrote:
>>>> When detach_tasks() scans the src_cpu's task list, the task list
>>>> may shrink during the scanning. For example, the task list
>>>> may have four tasks at the beginning, it may becomes to two
>>>> during the scanning in detach_tasks():
>>>>       Task list at beginning : "ABCD"
>>>>       Task list in scanning  : "CD"
>>>>
>>>>       (ABCD stands for differnt tasks.)
>>>>
>>>> In this scenario, the env->loop_max is still four, so
>>>> detach_tasks() may scan twice for some tasks:
>>>>       the scanning order maybe : "DCDC"
>>>>
>>> Huh, a quick hacky test suggests this isn't /too/ hard to trigger; I get
>>> about one occurrence every two default hackbench run (~200ms) on my dummy
>>> QEMU setup.
>>>
>>> Have you seen this happen on your workloads or did you find this while
>>> staring at the code?
>> I found this issue in my Specjbb2015 test.  It is very easy to trigger.
>>
> That would be good to include in the changelog.
okay.
>> I noticed many times in the test.
>>
>> I even found that:
>>
>>           At the beginning: env->loop_max is 12.
>>
>>          When the detach_tasks() scans: the real task list shrink to 5.
>>
> That's set using rq->nr_running which includes more than just the CFS
> tasks, and looking at the git history it looks like it's almost always been
> the case.
>
> Looks like env->loop_max really is only used for detach_tasks(), so perhaps
> a "better" fix would be something like the below, so that we can't iterate
> more than length(env->src_rq->cfs_tasks). That is, assuming
>
>    rq->cfs.h_nr_queued == length(rq->cfs_tasks)
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b9b4bbbf0af6f..32ae24aa377ca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11687,7 +11687,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>   		 * still unbalanced. ld_moved simply stays zero, so it is
>   		 * correctly treated as an imbalance.
>   		 */
> -		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
> +		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->cfs.h_nr_queued);

I tested this patch, it did not work. I still can catch lots of 
occurrences of this issue in Specjbb test.


IMHO, the root cause of this issue is env.loop_max is set out of the 
rq's lock.

Even we set env.loop_max to busiest->cfs.h_nr_queued, the real tasks 
length still can shrink in

other places.


Btw, the real tasks's length can also bigger then env.loop_max.

  For example, we set env.loop_max with 5, when detach_tasks() runs, the 
real tasks' length can be changed to 7.


Thanks

Huang Shijie

>   
>   more_balance:
>   		rq_lock_irqsave(busiest, &rf);
>