linux-kernel - Re: [PATCH v2] sched/fair: bump sd->max_newidle_lb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ccc99276-2f26-4cea-9f55-be7f75c8bf00@intel.com>
Date: Wed, 16 Jul 2025 23:58:33 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Peter Zijlstra <peterz@...radead.org>
CC: kernel test robot <oliver.sang@...el.com>, Chris Mason <clm@...com>,
	<oe-lkp@...ts.linux.dev>, <lkp@...el.com>, <linux-kernel@...r.kernel.org>,
	<aubrey.li@...ux.intel.com>, <vincent.guittot@...aro.org>
Subject: Re: [PATCH v2] sched/fair: bump sd->max_newidle_lb_cost when newidle
 balance fails

On 7/16/2025 7:25 PM, Peter Zijlstra wrote:
> On Tue, Jul 15, 2025 at 06:08:43PM +0800, Chen, Yu C wrote:
>> On 7/15/2025 3:08 PM, kernel test robot wrote:
>>>
>>>
>>> Hello,
>>>
>>> kernel test robot noticed a 22.9% regression of unixbench.throughput on:
>>>
>>>
>>> commit: ac34cb39e8aea9915ec2f4e08c979eb2ed1d7561 ("[PATCH v2] sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails")
>>> url: https://github.com/intel-lab-lkp/linux/commits/Chris-Mason/sched-fair-bump-sd-max_newidle_lb_cost-when-newidle-balance-fails/20250626-224805
>>> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 5bc34be478d09c4d16009e665e020ad0fcd0deea
>>> patch link: https://lore.kernel.org/all/20250626144017.1510594-2-clm@fb.com/
>>> patch subject: [PATCH v2] sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails
>>>
>>> testcase: unixbench
>>> config: x86_64-rhel-9.4
>>> compiler: gcc-12
>>> test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
>>> parameters:
>>>
>>> 	runtime: 300s
>>> 	nr_task: 100%
>>> 	test: shell1
>>> 	cpufreq_governor: performance
>>>
>>>
>> ...
>>
>>>
>>> commit:
>>>     5bc34be478 ("sched/core: Reorganize cgroup bandwidth control interface file writes")
>>>     ac34cb39e8 ("sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails")
>>>
>>> 5bc34be478d09c4d ac34cb39e8aea9915ec2f4e08c9
>>> ---------------- ---------------------------
>>>            %stddev     %change         %stddev
>>>                \          |                \
>> ...
>>
>>>        40.37           +16.9       57.24        mpstat.cpu.all.idle%
>>
>> This commit inhibits the newidle balance.
> 
> When not successful. So when newidle balance is not succeeding to pull
> tasks, it is backing off and doing less of it.
> 
>> It seems that some workloads
>> do not like newlyidle balance, like schbench, which is short duration
>> task. While other workloads want the newidle balance to pull at its best
>> effort, like unixbench shell test case.
>> Just wonder if we can check the sched domain's average utilization to
>> decide how hard we should trigger the newly idle balance, or can we check
>> the overutilized flag to decide whether we should launch the
>> new idle balance, something I was thinking of:
> 
> Looking at the actual util signal might be interesting, but as Chris
> already noted, overutilized isn't the right thing to look at. Simply
> taking rq->cfs.avg.util_avg might be more useful. Very high util and
> failure to pull might indicate new-idle just isn't very important /
> effective. While low util and failure might mean we should try harder.
> 
> Other things to look at:
> 
>   - if the sysctl_sched_migration_cost limit isn't artificially limiting
>     actual scanning costs. Eg. very large domains might perhaps have
>     costs that are genuinely larger than that somewhat random number.
> 
>   - if despite the apparent failure to pull, we do already have something
>     to run (eg. wakeups).
> 
>   - if the 3/2 backoff is perhaps too aggressive vs the 1% per second
>     decay.

Thanks for the suggestions, let me try to reproduce this issue locally
to see what is the proper way to get it addressed.


thanks,
Chenyu