[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230331040609.GA184843@ziqianlu-desk2>
Date: Fri, 31 Mar 2023 12:06:09 +0800
From: Aaron Lu <aaron.lu@...el.com>
To: Daniel Jordan <daniel.m.jordan@...cle.com>
CC: Dietmar Eggemann <dietmar.eggemann@....com>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
"Steven Rostedt" <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
"Valentin Schneider" <vschneid@...hat.com>,
Tim Chen <tim.c.chen@...el.com>,
"Nitin Tekchandani" <nitin.tekchandani@...el.com>,
Waiman Long <longman@...hat.com>,
Yu Chen <yu.c.chen@...el.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH] sched/fair: Make tg->load_avg per node
Hi Daniel,
Thanks for taking a look.
On Thu, Mar 30, 2023 at 03:51:57PM -0400, Daniel Jordan wrote:
> On Thu, Mar 30, 2023 at 01:46:02PM -0400, Daniel Jordan wrote:
> > Hi Aaron,
> >
> > On Wed, Mar 29, 2023 at 09:54:55PM +0800, Aaron Lu wrote:
> > > On Wed, Mar 29, 2023 at 02:36:44PM +0200, Dietmar Eggemann wrote:
> > > > On 28/03/2023 14:56, Aaron Lu wrote:
> > > > > On Tue, Mar 28, 2023 at 02:09:39PM +0200, Dietmar Eggemann wrote:
> > > > >> On 27/03/2023 07:39, Aaron Lu wrote:
> > > And not sure if you did the profile on different nodes? I normally chose
> > > 4 cpus of each node and do 'perf record -C' with them, to get an idea
> > > of how different node behaves and also to reduce the record size.
> > > Normally, when tg is allocated on node 0, then node 1's profile would
> > > show higher cycles for update_cfs_group() and update_load_avg().
> >
> > Wouldn't the choice of CPUs have a big effect on the data, depending on
> > where sysbench or postgres tasks run?
>
> Oh, probably not with NCPU threads though, since the load would be
> pretty even, so I think I see where you're coming from.
Yes I expect the load to be pretty even within the same node so didn't
do the full cpu record. I used to only record a single cpu on each node
to get a fast report time but settled on using 4 due to being paranoid :-)
>
> > > I guess your setup may have a much lower migration number?
> >
> > I also tried this and sure enough didn't see as many migrations on
> > either of two systems. I used a container with your steps with a plain
> > 6.2 kernel underneath, and the cpu controller is on (weight only). I
> > increased connections and buffer size to suit each machine, and took
> > Chen's suggestion to try without numa balancing.
I also tried disabling numa balancing per Chen's suggestion and I saw
slightly reduced migration on task wake up time for some runs but it
didn't make things dramatically different here.
> >
> > AMD EPYC 7J13 64-Core Processor
> > 2 sockets * 64 cores * 2 threads = 256 CPUs
I have a vague memory AMD machine has a smaller LLC and cpus belonging
to the same LLC is also not many, 8-16?
I tend to think cpu number of LLC play a role here since that's the
domain where idle cpu is searched on task wake up time.
> >
> > sysbench: nr_threads=256
> >
> > All observability data was taken at one minute in and using one tool at
> > a time.
> >
> > @migrations[1]: 1113
> > @migrations[0]: 6152
> > @wakeups[1]: 8871744
> > @wakeups[0]: 9773321
What a nice number for migration!
Of the 10 million wakeups, there are only several thousand migrations
compared to 4-5 millions on my side.
> >
> > # profiled the whole system for 5 seconds, reported w/ --sort=dso,symbol
> > 0.38% update_load_avg
> > 0.13% update_cfs_group
With such a small number of migration, the above percent is expected.
> >
> > Using higher (nr_threads=380) and lower (nr_threads=128) load doesn't
> > change these numbers much.
> >
> > The topology of my machine is different from yours, but it's the biggest
> > I have, and I'm assuming cpu count is more important than topology when
> > reproducing the remote accesses. I also tried on
> >
> > Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
> > 2 sockets * 32 cores * 2 thread = 128 CPUs
> >
> > with nr_threads=128 and got similar results.
> >
> > I'm guessing you've left all sched knobs alone? Maybe sharing those and
Yes I've left all knobs alone. The server I have access to has Ubuntu
22.04.1 installed and here are the values of these knobs:
root@...f01924c30:/sys/kernel/debug/sched# sysctl -a |grep sched
kernel.sched_autogroup_enabled = 1
kernel.sched_cfs_bandwidth_slice_us = 5000
kernel.sched_child_runs_first = 0
kernel.sched_deadline_period_max_us = 4194304
kernel.sched_deadline_period_min_us = 100
kernel.sched_energy_aware = 1
kernel.sched_rr_timeslice_ms = 100
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_schedstats = 0
kernel.sched_util_clamp_max = 1024
kernel.sched_util_clamp_min = 1024
kernel.sched_util_clamp_min_rt_default = 1024
root@...f01924c30:/sys/kernel/debug/sched# for i in `ls features *_ns *_ms preempt`; do echo "$i: `cat $i`"; done
features: GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_HRTICK_DL NO_DOUBLE_TICK NONTASK_CAPACITY TTWU_QUEUE NO_SIS_PROP SIS_UTIL NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI NO_RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS UTIL_EST UTIL_EST_FASTUP NO_LATENCY_WARN ALT_PERIOD BASE_SLICE
idle_min_granularity_ns: 750000
latency_ns: 24000000
latency_warn_ms: 100
migration_cost_ns: 500000
min_granularity_ns: 3000000
preempt: none (voluntary) full
wakeup_granularity_ns: 4000000
> > the kconfig would help close the gap. Migrations do increase to near
> > what you were seeing when I disable SIS_UTIL (with SIS_PROP already off)
> > on the Xeon, and I see 4-5% apiece for the functions you mention when
> > profiling, but turning SIS_UTIL off is an odd thing to do.
As you can see from above, I didn't turn off SIS_UTIL.
And attached kconfig, it's basically what the distro provided except I
had to disable some configs related to module sign or something like
that.
View attachment "config_spr" of type "text/plain" (275839 bytes)
Powered by blists - more mailing lists