linux-kernel - Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <29172279-ac5e-4860-921f-2905639dd8bf@amd.com>
Date: Thu, 18 Jul 2024 14:58:30 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Peter Zijlstra <peterz@...radead.org>, Chen Yu <yu.c.chen@...el.com>
CC: Vincent Guittot <vincent.guittot@...aro.org>, Ingo Molnar
	<mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, Tim Chen
	<tim.c.chen@...el.com>, Mel Gorman <mgorman@...hsingularity.net>, "Dietmar
 Eggemann" <dietmar.eggemann@....com>, "Gautham R . Shenoy"
	<gautham.shenoy@....com>, Chen Yu <yu.chen.surf@...il.com>, Aaron Lu
	<aaron.lu@...el.com>, <linux-kernel@...r.kernel.org>, <void@...ifault.com>,
	Matt Fleming <matt@...dmodwrite.com>
Subject: Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle
 balance

Hello Peter,

On 7/17/2024 5:47 PM, Peter Zijlstra wrote:
> On Thu, Jul 27, 2023 at 10:33:58PM +0800, Chen Yu wrote:
>> Hi,
>>
>> This is the second version of the newidle balance optimization[1].
>> It aims to reduce the cost of newidle balance which is found to
>> occupy noticeable CPU cycles on some high-core count systems.
>>
>> For example, when running sqlite on Intel Sapphire Rapids, which has
>> 2 x 56C/112T = 224 CPUs:
>>
>> 6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
>> 5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats
>>
>> To mitigate this cost, the optimization is inspired by the question
>> raised by Tim:
>> Do we always have to find the busiest group and pull from it? Would
>> a relatively busy group be enough?
> 
> So doesn't this basically boil down to recognising that new-idle might
> not be the same as regular load-balancing -- we need any task, fast,
> rather than we need to make equal load.
> 
> David's shared runqueue patches did the same, they re-imagined this very
> path.
> 
> Now, David's thing went side-ways because of some regression that wasn't
> further investigated.

In case of SHARED_RUNQ, I suspected frequent wakeup-sleep pattern of
hackbench at lower utilization seemed to raise some contention somewhere
but perf profile with IBS showed nothing specific and I left it there.

I revisited this again today and found this interesting data for perf
bench sched messaging running with one group pinned to one LLC domain on
my system:

- NO_SHARED_RUNQ

     $ time ./perf bench sched messaging -p -t -l 100000 -g 1
     # Running 'sched/messaging' benchmark:
     # 20 sender and receiver threads per group
     # 1 groups == 40 threads run
     
          Total time: 3.972 [sec] (*)
     
     real    0m3.985s
     user    0m6.203s	(*)
     sys     1m20.087s	(*)

     $ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
     $ sudo perf report --no-children

     Samples: 128  of event 'offcpu-time', Event count (approx.): 96,216,883,498 (*)
       Overhead  Command          Shared Object  Symbol
     +   51.43%  sched-messaging  libc.so.6      [.] read
     +   44.94%  sched-messaging  libc.so.6      [.] __GI___libc_write
     +    3.60%  sched-messaging  libc.so.6      [.] __GI___futex_abstimed_wait_cancelable64
          0.03%  sched-messaging  libc.so.6      [.] __poll
          0.00%  sched-messaging  perf           [.] sender


- SHARED_RUNQ

     $ time taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
     # Running 'sched/messaging' benchmark:
     # 20 sender and receiver threads per group
     # 1 groups == 40 threads run
     
          Total time: 48.171 [sec] (*)
     
     real    0m48.186s
     user    0m5.409s	(*)
     sys     0m41.185s	(*)

     $ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
     $ sudo perf report --no-children

     Samples: 157  of event 'offcpu-time', Event count (approx.): 5,882,929,338,882 (*)
       Overhead  Command          Shared Object     Symbol
     +   47.49%  sched-messaging  libc.so.6         [.] read
     +   46.33%  sched-messaging  libc.so.6         [.] __GI___libc_write
     +    2.40%  sched-messaging  libc.so.6         [.] __GI___futex_abstimed_wait_cancelable64
     +    1.08%  snapd            snapd             [.] 0x000000000006caa3
     +    1.02%  cron             libc.so.6         [.] clock_nanosleep@...BC_2.2.5
     +    0.86%  containerd       containerd        [.] runtime.futex.abi0
     +    0.82%  containerd       containerd        [.] runtime/internal/syscall.Syscall6


(*) The runtime has bloated massively but both "user" and "sys" time
     are down and the "offcpu-time" count goes up with SHARED_RUNQ.

There seems to be a corner case that is not accounted for but I'm not
sure where it lies currently. P.S. I tested this on a v6.8-rc4 kernel
since that is what I initially tested the series on but I can see the
same behavior when I rebased the changed on the current v6.10-rc5 based
tip:sched/core.

> 
> But it occurs to me this might be the same thing that Prateek chased
> down here:
> 
>    https://lkml.kernel.org/r/20240710090210.41856-1-kprateek.nayak@amd.com
> 
> Hmm ?

Without the nohz_csd_func fix and the SM_IDLE fast-path (Patch 1 and 2),
currently, the scheduler depends on the newidle_balance to pull tasks to
an idle CPU. Vincent had pointed it out on the first RCF to tackle the
problem that tried to do what SM_IDLE does but for fair class alone:

     https://lore.kernel.org/all/CAKfTPtC446Lo9CATPp7PExdkLhHQFoBuY-JMGC7agOHY4hs-Pw@mail.gmail.com/

It shouldn't be too frequent but it could be the reason why
newidle_balance() might jump up in traces, especially if it decides to
scan a domain with large number of CPUs (NUMA1/NUMA2 in Matt's case,
perhaps the PKG/NUMA in the case Chenyu was investigating initially).

> 
> Supposing that is indeed the case, I think it makes more sense to
> proceed with that approach. That is, completely redo the sub-numa new
> idle balance.
> 
> 

-- 
Thanks and Regards,
Prateek