lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 3 May 2023 15:41:25 -0400
From:   Daniel Jordan <daniel.m.jordan@...cle.com>
To:     Aaron Lu <aaron.lu@...el.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Tim Chen <tim.c.chen@...el.com>,
        Nitin Tekchandani <nitin.tekchandani@...el.com>,
        Waiman Long <longman@...hat.com>,
        Yu Chen <yu.c.chen@...el.com>, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

On Fri, Apr 21, 2023 at 11:05:59PM +0800, Aaron Lu wrote:
> On Thu, Apr 20, 2023 at 04:52:01PM -0400, Daniel Jordan wrote:
> > AMD EPYC 7J13 64-Core Processor (NPS1)
> >     2 sockets * 64 cores * 2 threads = 256 CPUs
> >
> >                       update_load_avg profile%    update_cfs_group profile%
> > affinity  nr_threads          base  test  diff             base  test  diff
> >  unbound          96           0.7   0.6  -0.1              0.3   0.6   0.4
> >  unbound         128           0.8   0.7   0.0              0.3   0.7   0.4
> >  unbound         160           2.4   1.7  -0.7              1.2   2.3   1.1
> >  unbound         192           2.3   1.7  -0.6              0.9   2.4   1.5
> >  unbound         224           0.9   0.9   0.0              0.3   0.6   0.3
> >  unbound         256           0.4   0.4   0.0              0.1   0.2   0.1
>
> Is it possible to show per-node profile for the two functions? I wonder
> how the per-node profile changes with and without this patch on Milan.
> And for vanilla kernel, it would be good to know on which node the struct
> task_group is allocated. I used below script to fetch this info:
> kretfunc:sched_create_group
> {
>         $root = kaddr("root_task_group");
> 	if (args->parent == $root) {
> 		return;
> 	}
>
> 	printf("cpu%d, node%d: tg=0x%lx, parent=%s\n", cpu, numaid,
> 			retval, str(args->parent->css.cgroup->kn->name));
> }

That's helpful, nid below comes from this.  The node happened to be different
between base and test kernels on both machines, so that's one less way the
experiment is controlled but for the unbound case where tasks are presumably
spread fairly evenly I'm not sure how much it matters, especially given that
the per-node profile numbers are fairly close to each other.


Data below, same parameters and times as the last mail.

> BTW, is the score(transactions) of the workload stable? If so, how the
> score change when the patch is applied?

Transactions seem to be mostly stable but unfortunately regress overall on both
machines.

FWIW, t-test compares the two sets of ten iterations apiece.  The higher the
percentage, the higher the confidence that the difference is significant.


AMD EPYC 7J13 64-Core Processor (NPS1)
    2 sockets * 64 cores * 2 threads = 256 CPUs

transactions per second

                                   diff                 base                test
                      -----------------   ------------------  ------------------
                          tps       tps
affinity  nr_threads  (%diff)  (t-test)       tps  std%  nid      tps  std%  nid
 unbound          96    -0.8%      100%   128,450    0%    1  127,433    0%    0
 unbound         128    -1.0%      100%   138,471    0%    1  137,099    0%    0
 unbound         160    -1.2%      100%   136,829    0%    1  135,170    0%    0
 unbound         192     0.4%       95%   152,767    0%    1  153,336    0%    0
 unbound         224    -0.2%       81%   179,946    0%    1  179,620    0%    0
 unbound         256    -0.2%       71%   203,920    0%    1  203,583    0%    0
   node0          48     0.1%       46%    69,635    0%    0   69,719    0%    0
   node0          64    -0.1%       69%    75,213    0%    0   75,163    0%    0
   node0          80    -0.4%      100%    72,520    0%    0   72,217    0%    0
   node0          96    -0.2%       89%    81,345    0%    0   81,210    0%    0
   node0         112    -0.3%       98%    96,174    0%    0   95,855    0%    0
   node0         128    -0.7%      100%   111,813    0%    0  111,045    0%    0
   node1          48     0.3%       78%    69,985    1%    1   70,200    1%    1
   node1          64     0.6%      100%    75,770    0%    1   76,231    0%    1
   node1          80     0.3%      100%    73,329    0%    1   73,567    0%    1
   node1          96     0.4%       99%    82,222    0%    1   82,556    0%    1
   node1         112     0.1%       62%    96,573    0%    1   96,689    0%    1
   node1         128    -0.2%       69%   111,614    0%    1  111,435    0%    1

update_load_avg profile%

                               all_nodes             node0             node1
		        ----------------  ----------------  ----------------
affinity  nr_threads    base  test  diff  base  test  diff  base  test  diff
 unbound          96     0.7   0.6  -0.1   0.7   0.6  -0.1   0.7   0.6  -0.1
 unbound         128     0.8   0.7  -0.1   0.8   0.7  -0.1   0.8   0.7  -0.1
 unbound         160     2.3   1.7  -0.7   2.5   1.7  -0.8   2.2   1.6  -0.5
 unbound         192     2.2   1.6  -0.6   2.5   1.8  -0.7   2.0   1.4  -0.6
 unbound         224     0.9   0.8  -0.1   1.1   0.7  -0.3   0.8   0.8   0.0
 unbound         256     0.4   0.4   0.0   0.4   0.4   0.0   0.4   0.4   0.0
   node0          48     0.7   0.6  -0.1
   node0          64     0.8   0.7  -0.2
   node0          80     2.0   1.4  -0.7
   node0          96     2.3   1.4  -0.9
   node0         112     1.0   0.8  -0.2
   node0         128     0.5   0.4   0.0
   node1          48     0.7   0.6  -0.1
   node1          64     0.8   0.6  -0.1
   node1          80     1.4   1.2  -0.2
   node1          96     1.5   1.3  -0.2
   node1         112     0.8   0.7  -0.1
   node1         128     0.4   0.4  -0.1

update_cfs_group profile%

                               all_nodes             node0             node1
		        ----------------  ----------------  ----------------
affinity  nr_threads    base  test  diff  base  test  diff  base  test  diff
 unbound          96     0.3   0.6   0.3   0.3   0.6   0.3   0.3   0.6   0.3
 unbound         128     0.3   0.6   0.3   0.3   0.6   0.3   0.3   0.7   0.4
 unbound         160     1.1   2.5   1.4   1.3   2.2   0.9   0.9   2.8   1.9
 unbound         192     0.9   2.6   1.7   1.1   2.4   1.3   0.7   2.8   2.1
 unbound         224     0.3   0.8   0.5   0.4   0.6   0.3   0.2   0.9   0.6
 unbound         256     0.1   0.2   0.1   0.1   0.2   0.1   0.1   0.2   0.1
   node0          48     0.4   0.6   0.2
   node0          64     0.3   0.6   0.3
   node0          80     0.7   0.6  -0.1
   node0          96     0.6   0.6   0.0
   node0         112     0.3   0.4   0.1
   node0         128     0.1   0.2   0.1
   node1          48     0.3   0.6   0.3
   node1          64     0.3   0.6   0.3
   node1          80     0.3   0.6   0.3
   node1          96     0.3   0.6   0.3
   node1         112     0.2   0.3   0.2
   node1         128     0.1   0.2   0.1


Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
    2 sockets * 32 cores * 2 thread = 128 CPUs

transactions per second

                                   diff                 base                test
                      -----------------   ------------------  ------------------
                          tps       tps
affinity  nr_threads  (%diff)  (t-test)       tps  std%  nid      tps  std%  nid
 unbound          48    -0.9%      100%    75,500    0%    1   74,834    0%    0
 unbound          64    -0.4%      100%    81,687    0%    1   81,368    0%    0
 unbound          80    -0.4%      100%    78,620    0%    1   78,281    0%    0
 unbound          96    -0.5%       74%    78,949    1%    1   78,580    1%    0
 unbound         112    -2.9%       87%    94,189    3%    1   91,458    5%    0
 unbound         128    -1.4%      100%   117,557    0%    1  115,921    0%    0
   node0          24    -0.7%      100%    38,601    0%    0   38,333    0%    0
   node0          32    -1.2%      100%    41,539    0%    0   41,038    0%    0
   node0          40    -1.6%      100%    42,325    0%    0   41,662    0%    0
   node0          48    -1.3%      100%    41,956    0%    0   41,404    0%    0
   node0          56    -1.3%      100%    42,115    0%    0   41,569    0%    0
   node0          64    -1.0%      100%    62,431    0%    0   61,784    0%    0
   node1          24     0.0%        1%    38,752    0%    1   38,752    0%    1
   node1          32     0.9%      100%    42,568    0%    1   42,943    0%    1
   node1          40    -0.2%       87%    43,452    0%    1   43,358    0%    1
   node1          48    -0.5%      100%    43,047    0%    1   42,831    0%    1
   node1          56    -0.5%      100%    43,464    0%    1   43,259    0%    1
   node1          64     0.5%      100%    64,111    0%    1   64,450    0%    1

update_load_avg profile%

                               all_nodes             node0             node1
		        ----------------  ----------------  ----------------
affinity  nr_threads    base  test  diff  base  test  diff  base  test  diff
 unbound          48     0.5   0.5   0.0   0.5   0.5   0.0   0.4   0.5   0.0
 unbound          64     0.5   0.5   0.0   0.5   0.5   0.0   0.5   0.5   0.0
 unbound          80     2.0   1.8  -0.3   2.0   1.7  -0.3   2.0   1.8  -0.2
 unbound          96     3.4   2.8  -0.6   3.4   2.8  -0.6   3.4   2.9  -0.5
 unbound         112     2.5   2.3  -0.1   4.5   3.8  -0.8   0.5   0.9   0.5
 unbound         128     0.4   0.5   0.0   0.4   0.4   0.0   0.5   0.5   0.1
   node0          24     0.4   0.5   0.0
   node0          32     0.5   0.5   0.0
   node0          40     1.0   1.1   0.1
   node0          48     1.5   1.6   0.1
   node0          56     1.8   1.9   0.1
   node0          64     0.4   0.4   0.0
   node1          24     0.5   0.4   0.0
   node1          32     0.5   0.4   0.0
   node1          40     1.0   1.1   0.0
   node1          48     1.6   1.6   0.1
   node1          56     1.9   1.9   0.0
   node1          64     0.4   0.4  -0.1


update_cfs_group profile%

                               all_nodes             node0             node1
		        ----------------  ----------------  ----------------
affinity  nr_threads    base  test  diff  base  test  diff  base  test  diff
 unbound          48     0.3   0.5   0.2   0.3   0.5   0.2   0.3   0.5   0.2
 unbound          64     0.5   0.6   0.1   0.5   0.6   0.1   0.5   0.6   0.1
 unbound          80     2.8   2.5  -0.3   2.6   2.4  -0.2   2.9   2.5  -0.5
 unbound          96     3.7   3.3  -0.4   3.5   3.3  -0.2   3.9   3.3  -0.6
 unbound         112     4.2   3.2  -1.0   4.1   3.3  -0.7   4.4   3.1  -1.2
 unbound         128     0.4   0.5   0.1   0.4   0.5   0.1   0.4   0.5   0.1
   node0          24     0.3   0.5   0.2
   node0          32     0.3   0.4   0.1
   node0          40     0.7   0.8   0.1
   node0          48     0.8   0.9   0.1
   node0          56     0.8   0.9   0.1
   node0          64     0.2   0.4   0.1
   node1          24     0.3   0.5   0.2
   node1          32     0.3   0.5   0.2
   node1          40     0.8   0.9   0.1
   node1          48     0.8   0.9   0.1
   node1          56     0.9   0.9   0.1
   node1          64     0.2   0.4   0.1


There doesn't seem to be much of a pattern in the per-node breakdown.
Sometimes there's a bit more overhead on the node remote to the task_group
allocation than the node local to it, like I'd expect, and sometimes it's the
opposite.  Generally pretty even.

> >    node0          48           0.7   0.6  -0.1              0.3   0.6   0.3
> >    node0          64           0.7   0.7  -0.1              0.3   0.6   0.3
> >    node0          80           1.4   1.3  -0.1              0.3   0.6   0.3
> >    node0          96           1.5   1.4  -0.1              0.3   0.6   0.3
> >    node0         112           0.8   0.8   0.0              0.2   0.4   0.2
> >    node0         128           0.4   0.4   0.0              0.1   0.2   0.1
> >    node1          48           0.7   0.6  -0.1              0.3   0.6   0.3
> >    node1          64           0.7   0.6  -0.1              0.3   0.6   0.3
> >    node1          80           1.4   1.2  -0.1              0.3   0.6   0.3
> >    node1          96           1.4   1.3  -0.2              0.3   0.6   0.3
> >    node1         112           0.8   0.7  -0.1              0.2   0.3   0.2
> >    node1         128           0.4   0.4   0.0              0.1   0.2   0.1
>
> I can see why the cost of update_cfs_group() slightly increased since
> now there is no cross node access to tg->load_avg and the patched kernel
> doesn't provide any benefit but only incur some overhead due to indirect
> access to tg->load_avg, but why update_load_avg()'s cost dropped? I
> expect it to be roughly the same after patched or slightly increased.

Yeah, that's not immediately obvious, especially when the Intel machine doesn't
do this.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ