linux-kernel - Re: [RFC PATCH] sched/fair: Make tg->load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230405210425.5v4gl6tp54tgki65@parnassus.localdomain>
Date:   Wed, 5 Apr 2023 17:04:25 -0400
From:   Daniel Jordan <daniel.m.jordan@...cle.com>
To:     Aaron Lu <aaron.lu@...el.com>
Cc:     Dietmar Eggemann <dietmar.eggemann@....com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Tim Chen <tim.c.chen@...el.com>,
        Nitin Tekchandani <nitin.tekchandani@...el.com>,
        Waiman Long <longman@...hat.com>,
        Yu Chen <yu.c.chen@...el.com>, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

On Fri, Mar 31, 2023 at 12:06:09PM +0800, Aaron Lu wrote:
> Hi Daniel,
> 
> Thanks for taking a look.
> 
> On Thu, Mar 30, 2023 at 03:51:57PM -0400, Daniel Jordan wrote:
> > On Thu, Mar 30, 2023 at 01:46:02PM -0400, Daniel Jordan wrote:
> > > Hi Aaron,
> > > 
> > > On Wed, Mar 29, 2023 at 09:54:55PM +0800, Aaron Lu wrote:
> > > > On Wed, Mar 29, 2023 at 02:36:44PM +0200, Dietmar Eggemann wrote:
> > > > > On 28/03/2023 14:56, Aaron Lu wrote:
> > > > > > On Tue, Mar 28, 2023 at 02:09:39PM +0200, Dietmar Eggemann wrote:
> > > > > >> On 27/03/2023 07:39, Aaron Lu wrote:
> > > > And not sure if you did the profile on different nodes? I normally chose
> > > > 4 cpus of each node and do 'perf record -C' with them, to get an idea
> > > > of how different node behaves and also to reduce the record size.
> > > > Normally, when tg is allocated on node 0, then node 1's profile would
> > > > show higher cycles for update_cfs_group() and update_load_avg().
> > > 
> > > Wouldn't the choice of CPUs have a big effect on the data, depending on
> > > where sysbench or postgres tasks run?
> > 
> > Oh, probably not with NCPU threads though, since the load would be
> > pretty even, so I think I see where you're coming from.
> 
> Yes I expect the load to be pretty even within the same node so didn't
> do the full cpu record. I used to only record a single cpu on each node
> to get a fast report time but settled on using 4 due to being paranoid :-)

Mhm :-)  My 4-cpu profiles do look about the same as my all-system one.

> I have a vague memory AMD machine has a smaller LLC and cpus belonging
> to the same LLC is also not many, 8-16?

Yep, 16 cpus in every one.  It's a 32M LLC.

> I tend to think cpu number of LLC play a role here since that's the
> domain where idle cpu is searched on task wake up time.

That's true, I hadn't thought of that.

> > > I'm guessing you've left all sched knobs alone?  Maybe sharing those and
> 
> Yes I've left all knobs alone. The server I have access to has Ubuntu
> 22.04.1 installed and here are the values of these knobs:
> root@...f01924c30:/sys/kernel/debug/sched# sysctl -a |grep sched
> kernel.sched_autogroup_enabled = 1
> kernel.sched_cfs_bandwidth_slice_us = 5000
> kernel.sched_child_runs_first = 0
> kernel.sched_deadline_period_max_us = 4194304
> kernel.sched_deadline_period_min_us = 100
> kernel.sched_energy_aware = 1
> kernel.sched_rr_timeslice_ms = 100
> kernel.sched_rt_period_us = 1000000
> kernel.sched_rt_runtime_us = 950000
> kernel.sched_schedstats = 0
> kernel.sched_util_clamp_max = 1024
> kernel.sched_util_clamp_min = 1024
> kernel.sched_util_clamp_min_rt_default = 1024
> 
> root@...f01924c30:/sys/kernel/debug/sched# for i in `ls features *_ns *_ms preempt`; do echo "$i: `cat $i`"; done
> features: GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_HRTICK_DL NO_DOUBLE_TICK NONTASK_CAPACITY TTWU_QUEUE NO_SIS_PROP SIS_UTIL NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI NO_RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS UTIL_EST UTIL_EST_FASTUP NO_LATENCY_WARN ALT_PERIOD BASE_SLICE
> idle_min_granularity_ns: 750000
> latency_ns: 24000000
> latency_warn_ms: 100
> migration_cost_ns: 500000
> min_granularity_ns: 3000000
> preempt: none (voluntary) full
> wakeup_granularity_ns: 4000000

Right, figures, all the same on my machines.

> And attached kconfig, it's basically what the distro provided except I
> had to disable some configs related to module sign or something like
> that.

Thanks for all the info.  I got the same low perf percentages using your
kconfig as I got before (<0.50% for both functions), so maybe this just
takes a big machine with big LLCs, which sadly I haven't got.