linux-kernel - Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZplKZQYpqILia+aW@chenyu5-mobl2>
Date: Fri, 19 Jul 2024 01:01:25 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: K Prateek Nayak <kprateek.nayak@....com>
CC: Peter Zijlstra <peterz@...radead.org>, Vincent Guittot
	<vincent.guittot@...aro.org>, Ingo Molnar <mingo@...hat.com>, Juri Lelli
	<juri.lelli@...hat.com>, Tim Chen <tim.c.chen@...el.com>, Mel Gorman
	<mgorman@...hsingularity.net>, Dietmar Eggemann <dietmar.eggemann@....com>,
	"Gautham R . Shenoy" <gautham.shenoy@....com>, Chen Yu
	<yu.chen.surf@...il.com>, Aaron Lu <aaron.lu@...el.com>,
	<linux-kernel@...r.kernel.org>, <void@...ifault.com>, Matt Fleming
	<matt@...dmodwrite.com>
Subject: Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle
 balance

Hi Prateek,

On 2024-07-18 at 14:58:30 +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 7/17/2024 5:47 PM, Peter Zijlstra wrote:
> > On Thu, Jul 27, 2023 at 10:33:58PM +0800, Chen Yu wrote:
> > > Hi,
> > > 
> > > This is the second version of the newidle balance optimization[1].
> > > It aims to reduce the cost of newidle balance which is found to
> > > occupy noticeable CPU cycles on some high-core count systems.
> > > 
> > > For example, when running sqlite on Intel Sapphire Rapids, which has
> > > 2 x 56C/112T = 224 CPUs:
> > > 
> > > 6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
> > > 5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats
> > > 
> > > To mitigate this cost, the optimization is inspired by the question
> > > raised by Tim:
> > > Do we always have to find the busiest group and pull from it? Would
> > > a relatively busy group be enough?
> > 
> > So doesn't this basically boil down to recognising that new-idle might
> > not be the same as regular load-balancing -- we need any task, fast,
> > rather than we need to make equal load.
> > 
> > David's shared runqueue patches did the same, they re-imagined this very
> > path.
> > 
> > Now, David's thing went side-ways because of some regression that wasn't
> > further investigated.
> 
> In case of SHARED_RUNQ, I suspected frequent wakeup-sleep pattern of
> hackbench at lower utilization seemed to raise some contention somewhere
> but perf profile with IBS showed nothing specific and I left it there.
> 
> I revisited this again today and found this interesting data for perf
> bench sched messaging running with one group pinned to one LLC domain on
> my system:
> 
> - NO_SHARED_RUNQ
> 
>     $ time ./perf bench sched messaging -p -t -l 100000 -g 1
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver threads per group
>     # 1 groups == 40 threads run
>          Total time: 3.972 [sec] (*)
>     real    0m3.985s
>     user    0m6.203s	(*)
>     sys     1m20.087s	(*)
> 
>     $ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
>     $ sudo perf report --no-children
> 
>     Samples: 128  of event 'offcpu-time', Event count (approx.): 96,216,883,498 (*)
>       Overhead  Command          Shared Object  Symbol
>     +   51.43%  sched-messaging  libc.so.6      [.] read
>     +   44.94%  sched-messaging  libc.so.6      [.] __GI___libc_write
>     +    3.60%  sched-messaging  libc.so.6      [.] __GI___futex_abstimed_wait_cancelable64
>          0.03%  sched-messaging  libc.so.6      [.] __poll
>          0.00%  sched-messaging  perf           [.] sender
> 
> 
> - SHARED_RUNQ
> 
>     $ time taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver threads per group
>     # 1 groups == 40 threads run
>          Total time: 48.171 [sec] (*)
>     real    0m48.186s
>     user    0m5.409s	(*)
>     sys     0m41.185s	(*)
> 
>     $ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
>     $ sudo perf report --no-children
> 
>     Samples: 157  of event 'offcpu-time', Event count (approx.): 5,882,929,338,882 (*)
>       Overhead  Command          Shared Object     Symbol
>     +   47.49%  sched-messaging  libc.so.6         [.] read
>     +   46.33%  sched-messaging  libc.so.6         [.] __GI___libc_write
>     +    2.40%  sched-messaging  libc.so.6         [.] __GI___futex_abstimed_wait_cancelable64
>     +    1.08%  snapd            snapd             [.] 0x000000000006caa3
>     +    1.02%  cron             libc.so.6         [.] clock_nanosleep@...BC_2.2.5
>     +    0.86%  containerd       containerd        [.] runtime.futex.abi0
>     +    0.82%  containerd       containerd        [.] runtime/internal/syscall.Syscall6
> 
> 
> (*) The runtime has bloated massively but both "user" and "sys" time
>     are down and the "offcpu-time" count goes up with SHARED_RUNQ.
> 
> There seems to be a corner case that is not accounted for but I'm not
> sure where it lies currently. P.S. I tested this on a v6.8-rc4 kernel
> since that is what I initially tested the series on but I can see the
> same behavior when I rebased the changed on the current v6.10-rc5 based
> tip:sched/core.
> 
> > 
> > But it occurs to me this might be the same thing that Prateek chased
> > down here:
> > 
> >    https://lkml.kernel.org/r/20240710090210.41856-1-kprateek.nayak@amd.com
> > 
> > Hmm ?
> 
> Without the nohz_csd_func fix and the SM_IDLE fast-path (Patch 1 and 2),
> currently, the scheduler depends on the newidle_balance to pull tasks to
> an idle CPU. Vincent had pointed it out on the first RCF to tackle the
> problem that tried to do what SM_IDLE does but for fair class alone:
> 
>     https://lore.kernel.org/all/CAKfTPtC446Lo9CATPp7PExdkLhHQFoBuY-JMGC7agOHY4hs-Pw@mail.gmail.com/
> 
> It shouldn't be too frequent but it could be the reason why
> newidle_balance() might jump up in traces, especially if it decides to
> scan a domain with large number of CPUs (NUMA1/NUMA2 in Matt's case,
> perhaps the PKG/NUMA in the case Chenyu was investigating initially).
>

Yes, this is my understanding too, I'll apply your patches and have a re-test.

thanks,
Chenyu