lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d1cbc53d-d4cf-bc5a-6468-89e9a1d86f33@gentwo.org>
Date: Thu, 30 Oct 2025 09:35:23 -0700 (PDT)
From: Shubhang <sh@...two.org>
To: Dietmar Eggemann <dietmar.eggemann@....com>, 
    shubhang@...amperecomputing.com, Ingo Molnar <mingo@...hat.com>, 
    Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>, 
    Vincent Guittot <vincent.guittot@...aro.org>, 
    Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, 
    Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, 
    Shijie Huang <Shijie.Huang@...erecomputing.com>, 
    Frank Wang <zwang@...erecomputing.com>
cc: Christopher Lameter <cl@...two.org>, Adam Li <adam.li@...erecomputing.com>, 
    linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup

The system is an 80 core Ampere Altra with a two-level
sched domain topology. The MC domain contains all 80 cores.

I agree that placing the condition earlier in `select_idle_sibling()` 
aligns better with convention. I will move the check (EAS Aware) to the 
top of the function and submit a v2 patch.

Best,
Shubhang Kaushik

On Thu, 30 Oct 2025, Dietmar Eggemann wrote:

> On 18.10.25 01:00, Shubhang Kaushik via B4 Relay wrote:
>> From: Shubhang Kaushik <shubhang@...amperecomputing.com>
>>
>> Modify the wakeup path in `select_task_rq_fair()` to prioritize cache
>> locality for waking tasks. The previous fast path always attempted to
>> find an idle sibling, even if the task's prev CPU was not truly busy.
>>
>> The original problem was that under some circumstances, this could lead
>> to unnecessary task migrations away from a cache-hot core, even when
>> the task's prev CPU was a suitable candidate. The scheduler's internal
>> mechanism `cpu_overutilized()` provide an evaluation of CPU load.
>>
>> To address this, the wakeup heuristic is updated to check the status of
>> the task's `prev_cpu` first:
>> - If the `prev_cpu` is  not overutilized (as determined by
>>   `cpu_overutilized()`, via PELT), the task is woken up on
>>   its previous CPU. This leverages cache locality and avoids
>>   a potentially unnecessary migration.
>> - If the `prev_cpu` is considered busy or overutilized, the scheduler
>>   falls back to the existing behavior of searching for an idle sibling.
>
> How does you sched domain topology look like? How many CPUs do you have
> in your MC domain?
>
>>
>> Signed-off-by: Shubhang Kaushik <shubhang@...amperecomputing.com>
>> ---
>> This patch optimizes the scheduler's wakeup path to prioritize cache
>> locality by keeping a task on its previous CPU if it is not overutilized,
>> falling back to a sibling search only when necessary.
>> ---
>>  kernel/sched/fair.c | 11 ++++++++++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index bc0b7ce8a65d6bbe616953f530f7a02bb619537c..bb0d28d7d9872642cb5a4076caeb3ac9d8fe7bcd 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8618,7 +8618,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>  		new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
>>  	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
>>  		/* Fast path */
>> -		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
>> +
>> +		/*
>> +		 * Avoid wakeup on an overutilized CPU.
>> +		 * If the previous CPU is not overloaded, retain the same for cache locality.
>> +		 * Otherwise, search for an idle sibling.
>> +		 */
>> +		if (!cpu_overutilized(prev_cpu))
>> +			new_cpu = prev_cpu;
>
> IMHO, special conditions like this one are normally coded at the
> beginning of select_idle_sibling().
>
> [...]
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ