[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <29172279-ac5e-4860-921f-2905639dd8bf@amd.com>
Date: Thu, 18 Jul 2024 14:58:30 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Peter Zijlstra <peterz@...radead.org>, Chen Yu <yu.c.chen@...el.com>
CC: Vincent Guittot <vincent.guittot@...aro.org>, Ingo Molnar
<mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, Tim Chen
<tim.c.chen@...el.com>, Mel Gorman <mgorman@...hsingularity.net>, "Dietmar
Eggemann" <dietmar.eggemann@....com>, "Gautham R . Shenoy"
<gautham.shenoy@....com>, Chen Yu <yu.chen.surf@...il.com>, Aaron Lu
<aaron.lu@...el.com>, <linux-kernel@...r.kernel.org>, <void@...ifault.com>,
Matt Fleming <matt@...dmodwrite.com>
Subject: Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle
balance
Hello Peter,
On 7/17/2024 5:47 PM, Peter Zijlstra wrote:
> On Thu, Jul 27, 2023 at 10:33:58PM +0800, Chen Yu wrote:
>> Hi,
>>
>> This is the second version of the newidle balance optimization[1].
>> It aims to reduce the cost of newidle balance which is found to
>> occupy noticeable CPU cycles on some high-core count systems.
>>
>> For example, when running sqlite on Intel Sapphire Rapids, which has
>> 2 x 56C/112T = 224 CPUs:
>>
>> 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
>> 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats
>>
>> To mitigate this cost, the optimization is inspired by the question
>> raised by Tim:
>> Do we always have to find the busiest group and pull from it? Would
>> a relatively busy group be enough?
>
> So doesn't this basically boil down to recognising that new-idle might
> not be the same as regular load-balancing -- we need any task, fast,
> rather than we need to make equal load.
>
> David's shared runqueue patches did the same, they re-imagined this very
> path.
>
> Now, David's thing went side-ways because of some regression that wasn't
> further investigated.
In case of SHARED_RUNQ, I suspected frequent wakeup-sleep pattern of
hackbench at lower utilization seemed to raise some contention somewhere
but perf profile with IBS showed nothing specific and I left it there.
I revisited this again today and found this interesting data for perf
bench sched messaging running with one group pinned to one LLC domain on
my system:
- NO_SHARED_RUNQ
$ time ./perf bench sched messaging -p -t -l 100000 -g 1
# Running 'sched/messaging' benchmark:
# 20 sender and receiver threads per group
# 1 groups == 40 threads run
Total time: 3.972 [sec] (*)
real 0m3.985s
user 0m6.203s (*)
sys 1m20.087s (*)
$ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
$ sudo perf report --no-children
Samples: 128 of event 'offcpu-time', Event count (approx.): 96,216,883,498 (*)
Overhead Command Shared Object Symbol
+ 51.43% sched-messaging libc.so.6 [.] read
+ 44.94% sched-messaging libc.so.6 [.] __GI___libc_write
+ 3.60% sched-messaging libc.so.6 [.] __GI___futex_abstimed_wait_cancelable64
0.03% sched-messaging libc.so.6 [.] __poll
0.00% sched-messaging perf [.] sender
- SHARED_RUNQ
$ time taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
# Running 'sched/messaging' benchmark:
# 20 sender and receiver threads per group
# 1 groups == 40 threads run
Total time: 48.171 [sec] (*)
real 0m48.186s
user 0m5.409s (*)
sys 0m41.185s (*)
$ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
$ sudo perf report --no-children
Samples: 157 of event 'offcpu-time', Event count (approx.): 5,882,929,338,882 (*)
Overhead Command Shared Object Symbol
+ 47.49% sched-messaging libc.so.6 [.] read
+ 46.33% sched-messaging libc.so.6 [.] __GI___libc_write
+ 2.40% sched-messaging libc.so.6 [.] __GI___futex_abstimed_wait_cancelable64
+ 1.08% snapd snapd [.] 0x000000000006caa3
+ 1.02% cron libc.so.6 [.] clock_nanosleep@...BC_2.2.5
+ 0.86% containerd containerd [.] runtime.futex.abi0
+ 0.82% containerd containerd [.] runtime/internal/syscall.Syscall6
(*) The runtime has bloated massively but both "user" and "sys" time
are down and the "offcpu-time" count goes up with SHARED_RUNQ.
There seems to be a corner case that is not accounted for but I'm not
sure where it lies currently. P.S. I tested this on a v6.8-rc4 kernel
since that is what I initially tested the series on but I can see the
same behavior when I rebased the changed on the current v6.10-rc5 based
tip:sched/core.
>
> But it occurs to me this might be the same thing that Prateek chased
> down here:
>
> https://lkml.kernel.org/r/20240710090210.41856-1-kprateek.nayak@amd.com
>
> Hmm ?
Without the nohz_csd_func fix and the SM_IDLE fast-path (Patch 1 and 2),
currently, the scheduler depends on the newidle_balance to pull tasks to
an idle CPU. Vincent had pointed it out on the first RCF to tackle the
problem that tried to do what SM_IDLE does but for fair class alone:
https://lore.kernel.org/all/CAKfTPtC446Lo9CATPp7PExdkLhHQFoBuY-JMGC7agOHY4hs-Pw@mail.gmail.com/
It shouldn't be too frequent but it could be the reason why
newidle_balance() might jump up in traces, especially if it decides to
scan a domain with large number of CPUs (NUMA1/NUMA2 in Matt's case,
perhaps the PKG/NUMA in the case Chenyu was investigating initially).
>
> Supposing that is indeed the case, I think it makes more sense to
> proceed with that approach. That is, completely redo the sub-numa new
> idle balance.
>
>
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists