linux-kernel - Re: [RFC PATCH] sched/fair: Make tg->load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230331040609.GA184843@ziqianlu-desk2>
Date:   Fri, 31 Mar 2023 12:06:09 +0800
From:   Aaron Lu <aaron.lu@...el.com>
To:     Daniel Jordan <daniel.m.jordan@...cle.com>
CC:     Dietmar Eggemann <dietmar.eggemann@....com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        "Steven Rostedt" <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        "Valentin Schneider" <vschneid@...hat.com>,
        Tim Chen <tim.c.chen@...el.com>,
        "Nitin Tekchandani" <nitin.tekchandani@...el.com>,
        Waiman Long <longman@...hat.com>,
        Yu Chen <yu.c.chen@...el.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

Hi Daniel,

Thanks for taking a look.

On Thu, Mar 30, 2023 at 03:51:57PM -0400, Daniel Jordan wrote:
> On Thu, Mar 30, 2023 at 01:46:02PM -0400, Daniel Jordan wrote:
> > Hi Aaron,
> > 
> > On Wed, Mar 29, 2023 at 09:54:55PM +0800, Aaron Lu wrote:
> > > On Wed, Mar 29, 2023 at 02:36:44PM +0200, Dietmar Eggemann wrote:
> > > > On 28/03/2023 14:56, Aaron Lu wrote:
> > > > > On Tue, Mar 28, 2023 at 02:09:39PM +0200, Dietmar Eggemann wrote:
> > > > >> On 27/03/2023 07:39, Aaron Lu wrote:
> > > And not sure if you did the profile on different nodes? I normally chose
> > > 4 cpus of each node and do 'perf record -C' with them, to get an idea
> > > of how different node behaves and also to reduce the record size.
> > > Normally, when tg is allocated on node 0, then node 1's profile would
> > > show higher cycles for update_cfs_group() and update_load_avg().
> > 
> > Wouldn't the choice of CPUs have a big effect on the data, depending on
> > where sysbench or postgres tasks run?
> 
> Oh, probably not with NCPU threads though, since the load would be
> pretty even, so I think I see where you're coming from.

Yes I expect the load to be pretty even within the same node so didn't
do the full cpu record. I used to only record a single cpu on each node
to get a fast report time but settled on using 4 due to being paranoid :-)

> 
> > > I guess your setup may have a much lower migration number?
> > 
> > I also tried this and sure enough didn't see as many migrations on
> > either of two systems.  I used a container with your steps with a plain
> > 6.2 kernel underneath, and the cpu controller is on (weight only).  I
> > increased connections and buffer size to suit each machine, and took
> > Chen's suggestion to try without numa balancing.

I also tried disabling numa balancing per Chen's suggestion and I saw
slightly reduced migration on task wake up time for some runs but it
didn't make things dramatically different here.

> > 
> > AMD EPYC 7J13 64-Core Processor
> >     2 sockets * 64 cores * 2 threads = 256 CPUs

I have a vague memory AMD machine has a smaller LLC and cpus belonging
to the same LLC is also not many, 8-16?

I tend to think cpu number of LLC play a role here since that's the
domain where idle cpu is searched on task wake up time.

> > 
> > sysbench: nr_threads=256
> > 
> > All observability data was taken at one minute in and using one tool at
> > a time.
> > 
> >     @migrations[1]: 1113
> >     @migrations[0]: 6152
> >     @wakeups[1]: 8871744
> >     @wakeups[0]: 9773321

What a nice number for migration!
Of the 10 million wakeups, there are only several thousand migrations
compared to 4-5 millions on my side.

> > 
> >     # profiled the whole system for 5 seconds, reported w/ --sort=dso,symbol
> >     0.38%       update_load_avg
> >     0.13%       update_cfs_group

With such a small number of migration, the above percent is expected.

> > 
> > Using higher (nr_threads=380) and lower (nr_threads=128) load doesn't
> > change these numbers much.
> > 
> > The topology of my machine is different from yours, but it's the biggest
> > I have, and I'm assuming cpu count is more important than topology when
> > reproducing the remote accesses.  I also tried on
> > 
> > Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
> >     2 sockets * 32 cores * 2 thread = 128 CPUs
> > 
> > with nr_threads=128 and got similar results.
> > 
> > I'm guessing you've left all sched knobs alone?  Maybe sharing those and

Yes I've left all knobs alone. The server I have access to has Ubuntu
22.04.1 installed and here are the values of these knobs:
root@...f01924c30:/sys/kernel/debug/sched# sysctl -a |grep sched
kernel.sched_autogroup_enabled = 1
kernel.sched_cfs_bandwidth_slice_us = 5000
kernel.sched_child_runs_first = 0
kernel.sched_deadline_period_max_us = 4194304
kernel.sched_deadline_period_min_us = 100
kernel.sched_energy_aware = 1
kernel.sched_rr_timeslice_ms = 100
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_schedstats = 0
kernel.sched_util_clamp_max = 1024
kernel.sched_util_clamp_min = 1024
kernel.sched_util_clamp_min_rt_default = 1024

root@...f01924c30:/sys/kernel/debug/sched# for i in `ls features *_ns *_ms preempt`; do echo "$i: `cat $i`"; done
features: GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_HRTICK_DL NO_DOUBLE_TICK NONTASK_CAPACITY TTWU_QUEUE NO_SIS_PROP SIS_UTIL NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI NO_RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS UTIL_EST UTIL_EST_FASTUP NO_LATENCY_WARN ALT_PERIOD BASE_SLICE
idle_min_granularity_ns: 750000
latency_ns: 24000000
latency_warn_ms: 100
migration_cost_ns: 500000
min_granularity_ns: 3000000
preempt: none (voluntary) full
wakeup_granularity_ns: 4000000

> > the kconfig would help close the gap.  Migrations do increase to near
> > what you were seeing when I disable SIS_UTIL (with SIS_PROP already off)
> > on the Xeon, and I see 4-5% apiece for the functions you mention when
> > profiling, but turning SIS_UTIL off is an odd thing to do.

As you can see from above, I didn't turn off SIS_UTIL.

And attached kconfig, it's basically what the distro provided except I
had to disable some configs related to module sign or something like
that.

View attachment "config_spr" of type "text/plain" (275839 bytes)