linux-kernel - Re: [RFC PATCH] sched/fair: Make tg->load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230405213117.jx2t5z3liowbr5su@parnassus.localdomain>
Date:   Wed, 5 Apr 2023 17:31:17 -0400
From:   Daniel Jordan <daniel.m.jordan@...cle.com>
To:     Aaron Lu <aaron.lu@...el.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Yu Chen <yu.c.chen@...el.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Tim Chen <tim.c.chen@...el.com>,
        Nitin Tekchandani <nitin.tekchandani@...el.com>,
        Waiman Long <longman@...hat.com>, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

On Tue, Apr 04, 2023 at 11:15:40PM +0800, Aaron Lu wrote:
> On Mon, Mar 27, 2023 at 01:39:55PM +0800, Aaron Lu wrote:
> [...]
> > Another observation of this workload is: it has a lot of wakeup time
> > task migrations and that is the reason why update_load_avg() and
> > update_cfs_group() shows noticeable cost. Running this workload in N
> > instances setup where N >= 2 with sysbench's nr_threads set to 1/N nr_cpu,
> > task migrations on wake up time are greatly reduced and the overhead from
> > the two above mentioned functions also dropped a lot. It's not clear to
> > me why running in multiple instances can reduce task migrations on
> > wakeup path yet.
> 
> Regarding this observation, I've some finding. The TLDR is: 1 instance
> setup's overall CPU util is lower than N >= 2 instances setup and as a
> result, under 1 instance setup, sis() is more likely to find idle cpus
> than N >= 2 instances setup and that is the reason why 1 instance setup
> has more migrations.
> 
> More details:
> 
> For 1 instance with nr_thread=nr_cpu=224 setup, during a 5s window,
> there are 10 million calls of select_idle_sibling() and 6.1 million
> migrations. Of these migrations, 4.6 million comes from select_idle_cpu(),
> 1.3 million comes from recent_cpu.
> mpstat of this time window:
> Average:    NODE    %usr   %nice    %sys %iowait    %irq   %soft  %steal %guest  %gnice   %idle
> Average:     all   45.15    0.00   18.59    0.00    0.00   17.29    0.00 0.00    0.00   18.98
> Average:       0   38.14    0.00   17.29    0.00    0.00   14.77    0.00 0.00    0.00   29.80
> Average:       1   52.07    0.00   19.88    0.00    0.00   19.78    0.00 0.00    0.00    8.28

Aha.  It takes one instance of nr_thread=(3/4)*nr_cpu to get this
overall utilization on my aforementioned Xeon, but then I see 3-4% on
both functions in the profile.  I'll poke at it some more, see how bad
it hurts over more loads, might take a bit though.

> For 4 instance with nr_thread=56 setup, during a 5s window, there are 15
> million calls of select_idle_sibling() and only 30k migrations.
> select_idle_cpu() is called 15 million times but only 23k of them passed
> the sd_share->nr_idle_scan != 0 test.
> mpstat of this time window:
> Average:    NODE    %usr   %nice    %sys %iowait    %irq   %soft  %steal %guest  %gnice   %idle
> Average:     all   68.54    0.00   21.54    0.00    0.00    8.35    0.00 0.00    0.00    1.58
> Average:       0   70.05    0.00   20.92    0.00    0.00    8.17    0.00 0.00    0.00    0.87
> Average:       1   67.03    0.00   22.16    0.00    0.00    8.53    0.00 0.00    0.00    2.29
> 
> For 8 instance with nr_thread=28 setup, during a 5s window, there are
> 16 million calls of select_idle_sibling() and 9.6k migrations.
> select_idle_cpu() is called 16 million times but none of them passed the
> sd_share->nr_idle_scan != 0 test.
> mpstat of this time window:
> Average:    NODE    %usr   %nice    %sys %iowait    %irq   %soft  %steal %guest  %gnice   %idle
> Average:     all   70.29    0.00   20.99    0.00    0.00    8.28    0.00 0.00    0.00    0.43
> Average:       0   71.58    0.00   19.98    0.00    0.00    8.04    0.00 0.00    0.00    0.40
> Average:       1   69.00    0.00   22.01    0.00    0.00    8.52    0.00 0.00    0.00    0.47
> 
> On a side note: when sd_share->nr_idle_scan > 0 and has_idle_core is true,
> then sd_share->nr_idle_scan is not actually respected. Is this intended?
> It seems to say: if there is idle core, then let's try hard and ignore
> SIS_UTIL to find that idle core, right?