lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZIqSFjOcDvD5WZ2F@BLR-5CG11610CF.amd.com>
Date:   Thu, 15 Jun 2023 09:52:46 +0530
From:   "Gautham R. Shenoy" <gautham.shenoy@....com>
To:     Chen Yu <yu.c.chen@...el.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Tim Chen <tim.c.chen@...el.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        Abel Wu <wuyun.abel@...edance.com>,
        Len Brown <len.brown@...el.com>,
        Chen Yu <yu.chen.surf@...il.com>,
        Yicong Yang <yangyicong@...ilicon.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/4] Limit the scan depth to find the busiest sched
 group during newidle balance

Hello Chen Yu,


On Tue, Jun 13, 2023 at 12:17:53AM +0800, Chen Yu wrote:
> Hi,
> 
> This is an attempt to reduce the cost of newidle balance which is
> found to occupy noticeable CPU cycles on some high-core count systems.
> 
> For example, by running sqlite on Intel Sapphire Rapids, which has
> 2 x 56C/112T = 224 CPUs:
> 
> 6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
> 5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats
> 
> The main idea comes from the following question raised by Tim:
> Do we always have to find the busiest group and pull from it? Would
> a relatively busy group be enough?
> 
> The proposal ILB_UTIL mainly adjusts the newidle balance scan depth
> within the current sched domain, based on the system utilization in
> this domain. The more spare time there is in the domain, the more time
> each newidle balance can spend on scanning for a busy group. Although
> the newidle balance has per domain max_newidle_lb_cost to decide
> whether to launch the balance or not, the ILB_UTIL provides a smaller
> granularity to decide how many groups each newidle balance can scan.

Thanks for the patchset. This is an interesting approach to minimise
the time spent in newidle balance when the system is relatively
busy. In the past we have seen that newly idle CPUs whose avg idle
duration is slightly higher than the max_newidle_balance_cost spend
quite a bit of time trying to find a busiest group/cpu only to not
find any task to pull. But that's the opportunity lost to go idle. I
suppose this patch is targetted towards a SMT --> MC(DIE) kind of
topologies where we have a lot of groups in the DIE domain, but it
would be interesting to see if this can help when there are fewer
groups in each sched-domain.

Will try this on our setup.

> 
> patch 1/4 is code cleanup.
> 
> patch 2/4 is to introduce a new variable in sched domain to indicate the
>           number of groups, and will be used by patch 3 and patch 4.
> 
> patch 3/4 is to calculate the scan depth in each periodic load balance.
> 
> patch 4/4 is to limit the scan depth based on the result of patch 3,
>           and the depth will be used by newidle_balance()->
>           find_busiest_group() -> update_sd_lb_stats()
> 
> 
> According to the test result, netperf/tbench shows some improvements
> when the system is underloaded, while no noticeable difference from
> hackbench/schbench. While I'm trying to run more benchmarks including
> some macro-benchmarks, I send this draft patch out and seek for suggestion
> from the community if this is the right thing to do and if we are in the
> right direction.
> 
> [We also have other wild ideas like sorting the groups by their load
> in the periodic load balance, later newidle_balance() can fetch the
> corresponding group in O(1). And this change seems to get improvement
> too according to the test result].
> 
> Any comments would be appreciated.
> 
> Chen Yu (4):
>   sched/fair: Extract the function to get the sd_llc_shared
>   sched/topology: Introduce nr_groups in sched_domain to indicate the
>     number of groups
>   sched/fair: Calculate the scan depth for idle balance based on system
>     utilization
>   sched/fair: Throttle the busiest group scanning in idle load balance
> 
>  include/linux/sched/topology.h |  5 +++
>  kernel/sched/fair.c            | 74 +++++++++++++++++++++++++++++-----
>  kernel/sched/features.h        |  1 +
>  kernel/sched/topology.c        | 10 ++++-
>  4 files changed, 79 insertions(+), 11 deletions(-)
> 

--
Thanks and Regards
gautham.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ