linux-kernel - Re: [RFC PATCH] sched/fair: Make tg->load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230504102746.GA137823@ziqianlu-desk2>
Date:   Thu, 4 May 2023 18:27:46 +0800
From:   Aaron Lu <aaron.lu@...el.com>
To:     Daniel Jordan <daniel.m.jordan@...cle.com>
CC:     Peter Zijlstra <peterz@...radead.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        "Steven Rostedt" <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        "Valentin Schneider" <vschneid@...hat.com>,
        Tim Chen <tim.c.chen@...el.com>,
        "Nitin Tekchandani" <nitin.tekchandani@...el.com>,
        Waiman Long <longman@...hat.com>,
        Yu Chen <yu.c.chen@...el.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

Hi Daniel,

Thanks a lot for collecting these data.

I had hoped to also share some data I collected on other machines after
seeing your last email but trying to explain why only SPR showed benefit
has slowed me down. I now has some finding on this, please see below.

On Wed, May 03, 2023 at 03:41:25PM -0400, Daniel Jordan wrote:
> On Fri, Apr 21, 2023 at 11:05:59PM +0800, Aaron Lu wrote:
> > On Thu, Apr 20, 2023 at 04:52:01PM -0400, Daniel Jordan wrote:
> > > AMD EPYC 7J13 64-Core Processor (NPS1)
> > >     2 sockets * 64 cores * 2 threads = 256 CPUs
> > >
> > >                       update_load_avg profile%    update_cfs_group profile%
> > > affinity  nr_threads          base  test  diff             base  test  diff
> > >  unbound          96           0.7   0.6  -0.1              0.3   0.6   0.4
> > >  unbound         128           0.8   0.7   0.0              0.3   0.7   0.4
> > >  unbound         160           2.4   1.7  -0.7              1.2   2.3   1.1
> > >  unbound         192           2.3   1.7  -0.6              0.9   2.4   1.5
> > >  unbound         224           0.9   0.9   0.0              0.3   0.6   0.3
> > >  unbound         256           0.4   0.4   0.0              0.1   0.2   0.1
> >
> > Is it possible to show per-node profile for the two functions? I wonder
> > how the per-node profile changes with and without this patch on Milan.
> > And for vanilla kernel, it would be good to know on which node the struct
> > task_group is allocated. I used below script to fetch this info:
> > kretfunc:sched_create_group
> > {
> >         $root = kaddr("root_task_group");
> > 	if (args->parent == $root) {
> > 		return;
> > 	}
> >
> > 	printf("cpu%d, node%d: tg=0x%lx, parent=%s\n", cpu, numaid,
> > 			retval, str(args->parent->css.cgroup->kn->name));
> > }
> 
> That's helpful, nid below comes from this.  The node happened to be different
> between base and test kernels on both machines, so that's one less way the
> experiment is controlled but for the unbound case where tasks are presumably
> spread fairly evenly I'm not sure how much it matters, especially given that
> the per-node profile numbers are fairly close to each other.
> 
> 
> Data below, same parameters and times as the last mail.
> 
> > BTW, is the score(transactions) of the workload stable? If so, how the
> > score change when the patch is applied?
> 
> Transactions seem to be mostly stable but unfortunately regress overall on both
> machines.

Yeah, I noticed your result is pretty stable in that the stddev% is
mostly zero. Mine are not that stable though. And it looks like there
are some wins in the node1 case :)

> FWIW, t-test compares the two sets of ten iterations apiece.  The higher the
> percentage, the higher the confidence that the difference is significant.
> 
> 
> AMD EPYC 7J13 64-Core Processor (NPS1)
>     2 sockets * 64 cores * 2 threads = 256 CPUs
> 
> transactions per second
> 
>                                    diff                 base                test
>                       -----------------   ------------------  ------------------
>                           tps       tps
> affinity  nr_threads  (%diff)  (t-test)       tps  std%  nid      tps  std%  nid
>  unbound          96    -0.8%      100%   128,450    0%    1  127,433    0%    0
>  unbound         128    -1.0%      100%   138,471    0%    1  137,099    0%    0
>  unbound         160    -1.2%      100%   136,829    0%    1  135,170    0%    0
>  unbound         192     0.4%       95%   152,767    0%    1  153,336    0%    0
>  unbound         224    -0.2%       81%   179,946    0%    1  179,620    0%    0
>  unbound         256    -0.2%       71%   203,920    0%    1  203,583    0%    0
>    node0          48     0.1%       46%    69,635    0%    0   69,719    0%    0
>    node0          64    -0.1%       69%    75,213    0%    0   75,163    0%    0
>    node0          80    -0.4%      100%    72,520    0%    0   72,217    0%    0
>    node0          96    -0.2%       89%    81,345    0%    0   81,210    0%    0
>    node0         112    -0.3%       98%    96,174    0%    0   95,855    0%    0
>    node0         128    -0.7%      100%   111,813    0%    0  111,045    0%    0
>    node1          48     0.3%       78%    69,985    1%    1   70,200    1%    1
>    node1          64     0.6%      100%    75,770    0%    1   76,231    0%    1
>    node1          80     0.3%      100%    73,329    0%    1   73,567    0%    1
>    node1          96     0.4%       99%    82,222    0%    1   82,556    0%    1
>    node1         112     0.1%       62%    96,573    0%    1   96,689    0%    1
>    node1         128    -0.2%       69%   111,614    0%    1  111,435    0%    1
>
> update_load_avg profile%
> 
>                                all_nodes             node0             node1
> 		        ----------------  ----------------  ----------------
> affinity  nr_threads    base  test  diff  base  test  diff  base  test  diff
>  unbound          96     0.7   0.6  -0.1   0.7   0.6  -0.1   0.7   0.6  -0.1
>  unbound         128     0.8   0.7  -0.1   0.8   0.7  -0.1   0.8   0.7  -0.1
>  unbound         160     2.3   1.7  -0.7   2.5   1.7  -0.8   2.2   1.6  -0.5
>  unbound         192     2.2   1.6  -0.6   2.5   1.8  -0.7   2.0   1.4  -0.6
>  unbound         224     0.9   0.8  -0.1   1.1   0.7  -0.3   0.8   0.8   0.0
>  unbound         256     0.4   0.4   0.0   0.4   0.4   0.0   0.4   0.4   0.0
>    node0          48     0.7   0.6  -0.1
>    node0          64     0.8   0.7  -0.2
>    node0          80     2.0   1.4  -0.7
>    node0          96     2.3   1.4  -0.9
>    node0         112     1.0   0.8  -0.2
>    node0         128     0.5   0.4   0.0
>    node1          48     0.7   0.6  -0.1
>    node1          64     0.8   0.6  -0.1
>    node1          80     1.4   1.2  -0.2
>    node1          96     1.5   1.3  -0.2
>    node1         112     0.8   0.7  -0.1
>    node1         128     0.4   0.4  -0.1
> 
> update_cfs_group profile%
> 
>                                all_nodes             node0             node1
> 		        ----------------  ----------------  ----------------
> affinity  nr_threads    base  test  diff  base  test  diff  base  test  diff
>  unbound          96     0.3   0.6   0.3   0.3   0.6   0.3   0.3   0.6   0.3
>  unbound         128     0.3   0.6   0.3   0.3   0.6   0.3   0.3   0.7   0.4
>  unbound         160     1.1   2.5   1.4   1.3   2.2   0.9   0.9   2.8   1.9
>  unbound         192     0.9   2.6   1.7   1.1   2.4   1.3   0.7   2.8   2.1
>  unbound         224     0.3   0.8   0.5   0.4   0.6   0.3   0.2   0.9   0.6
>  unbound         256     0.1   0.2   0.1   0.1   0.2   0.1   0.1   0.2   0.1
>    node0          48     0.4   0.6   0.2
>    node0          64     0.3   0.6   0.3
>    node0          80     0.7   0.6  -0.1
>    node0          96     0.6   0.6   0.0
>    node0         112     0.3   0.4   0.1
>    node0         128     0.1   0.2   0.1
>    node1          48     0.3   0.6   0.3
>    node1          64     0.3   0.6   0.3
>    node1          80     0.3   0.6   0.3
>    node1          96     0.3   0.6   0.3
>    node1         112     0.2   0.3   0.2
>    node1         128     0.1   0.2   0.1
>

update_load_avg()'s cost dropped while update_cfs_group()'s cost
increased. I think this is reasonable since the write side only has to
deal with local data now while the read side has to iterate the per-node
tg->load_avg on all nodes.

> 
> Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
>     2 sockets * 32 cores * 2 thread = 128 CPUs
> 
> transactions per second
> 
>                                    diff                 base                test
>                       -----------------   ------------------  ------------------
>                           tps       tps
> affinity  nr_threads  (%diff)  (t-test)       tps  std%  nid      tps  std%  nid
>  unbound          48    -0.9%      100%    75,500    0%    1   74,834    0%    0
>  unbound          64    -0.4%      100%    81,687    0%    1   81,368    0%    0
>  unbound          80    -0.4%      100%    78,620    0%    1   78,281    0%    0
>  unbound          96    -0.5%       74%    78,949    1%    1   78,580    1%    0
>  unbound         112    -2.9%       87%    94,189    3%    1   91,458    5%    0
>  unbound         128    -1.4%      100%   117,557    0%    1  115,921    0%    0
>    node0          24    -0.7%      100%    38,601    0%    0   38,333    0%    0
>    node0          32    -1.2%      100%    41,539    0%    0   41,038    0%    0
>    node0          40    -1.6%      100%    42,325    0%    0   41,662    0%    0
>    node0          48    -1.3%      100%    41,956    0%    0   41,404    0%    0
>    node0          56    -1.3%      100%    42,115    0%    0   41,569    0%    0
>    node0          64    -1.0%      100%    62,431    0%    0   61,784    0%    0
>    node1          24     0.0%        1%    38,752    0%    1   38,752    0%    1
>    node1          32     0.9%      100%    42,568    0%    1   42,943    0%    1
>    node1          40    -0.2%       87%    43,452    0%    1   43,358    0%    1
>    node1          48    -0.5%      100%    43,047    0%    1   42,831    0%    1
>    node1          56    -0.5%      100%    43,464    0%    1   43,259    0%    1
>    node1          64     0.5%      100%    64,111    0%    1   64,450    0%    1

This looks like mostly a loss for Icelake.

I also tested on the same Icelake 8358 and my result is not entirely
the same to yours:

nr_thread=128
                 score(tps)     update_cfs_group%   update_load_avg%
 6.2.0           97418±0.17%      0.50% - 0.74%       0.69% - 0.93%
this_patch       97029±0.32%      0.68% - 0.89%       0.70% - 0.89%

For the above nr_thread=128 unbound case, the score(tps) is in noise
range, instead of a 1.4% loss in your run. Profile wise, the write
side's cost slightly dropped while the read side's cost slightly
increased. Overall, no big change for nr_thread=128 on this Icelake.
I think this is also expected since in nr_thread=128 case, there are
very few migrations on wake up due to cpu utilization is almost 100% so
this patch shouldn't make an obvious difference.

nr_thread=96
                  score(tps)    update_cfs_group%  update_load_avg%
 6.2.0           59183±0.21%      2.81% - 3.57%       3.48% - 3.76%
this_patch       58397±0.35%      2.70% - 3.01%       2.82% - 3.24%

For this case, there are enough task migrations on wakeup for this patch
to make a difference: the tps dropped about 1.3%, worse than your run.
Profile wise, both write side and read side dropped. But these drops do
not translate to performance gains. I think from the profile, this patch
is doing something good, it's just the tps suggested otherwise.

On another 2S, 224 cpu Sapphire Rapids:

nr_thread=224
                 score          update_cfs_group%   update_load_avg%
6.2.0           93504±4.79%      11.63% - 15.12%      7.00% - 10.31%
this_patch     103040±0.46%       7.08% -  9.08%      4.82% -  6.73%

The above is where this patch helps the most, both profile and score
shows improvement. My finding about why only SPR shows benefit is, I
think this has something to do with SPR's "Ingress Queue overflow" when
many cpus access the same cache line and once that overflow happened,
all those accessing cpus will have their memory operations slowed down.
This is described in section 3.11 of Intel's optimization reference
manual.

To confirm the above explanation, I did the nr_thread=96 run on SPR. To
make sure task migration still happens in this case, some cpus are
offlined and only 128 cpus are left. With fewer threads, the chance of
ingress queue overflow is much lower:

nr_thread=96 with cpu offlined to 128c left (32cores/64cpus on each socket)

                 score          update_cfs_group%          update_load_avg%
6.2.0           74878±0.58%       3.47% - 4.90%               3.59% - 4.42%
this_patch      75671±0.24%       2.66% - 3.55%               2.71% - 3.44%

Profile wise, the two functions dropped to near Icelake level, still
higher than Icelake but much better than nr_thread=224 case. When
comparing base line with this patch, it follows what I saw on Icelake:
the two functions' cost dropped but that did not translate to
performance increase(average is slightly higer but it's in noise range).

Base on my current understanding, the summary is:
- Running this workload with nr_thread=224 on SPR, the ingress queue
  will overflow and that will slow things down. This patch helps
  performance mainly because it transform the "many cpus accessing the
  same cacheline" scenario to "many cpus accessing two cachelines" and
  that can reduce the likelyhood of ingress queue overflow and thus,
  helps performance;
- On Icelake with high nr_threads but not too high that would cause
  100% cpu utilization, the two functions' cost will drop a little but
  performance did not improve(it actually regressed a little);
- On SPR when there is no ingress queue overflow, it's similar to
  Icelake: the two functions' cost will drop but performance did not
  improve.