lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e3da7191-d2c2-5c28-257e-7f52096c956e@oracle.com>
Date: Mon, 26 Feb 2024 19:47:55 -0500
From: Chris Hyser <chris.hyser@...cle.com>
To: Peter Zijlstra <peterz@...radead.org>, Mel Gorman <mgorman@...e.de>,
        linux-kernel@...r.kernel.org
Cc: Konrad Wilk <konrad.wilk@...cle.com>, chris.hyser@...cle.com
Subject: Re: [RFC 0/2] How effective is numa_preferred_nid w.r.t. NUMA
 performance?

Included is additional micro-benchmark data from an AMD 128 cpu machine
(EPYC 7551 processor) concerning the effectiveness of setting a task's
numa_preferred_nid with respect to improving the NUMA awareness of the
scheduler. The details of the test procedure are identical to that
described in the original RFC and while obviously this and the original
RFC are answers to a specific question asked by Peter, feedback on the
experimental setup as well as the data would be appreciated.

The original RFC can be found at:
[https://lore.kernel.org/lkml/20231216001801.3015832-1-chris.hyser@oracle.com/]

Key:
-----------------
NB   - auto-numa-balancing (0 - off, 1 - on)
PNID - the prctl() "forced" numa_preferred_nid, ie 'Preferred Node
             Affinity'.
            (given 8 nodes:  0, 1, 2, 3, 4, 5, 6, 7 and -1 for not_set)
Mem  - represents the Memory node when memory is bound, else 'F' floating,
            ie not set
CPU  - represents the CPUs of the node that the probe is hard-affined
            to, else 'F' floating, ie not set
Avg  - the average time of the probe's measurements in secs

NumSamples: 36
Load: 60
CPU_Model: AMD EPYC 7551 32-Core Processor
NUM_CPUS: 128
Migration Cost: 500000

       Avg     max     min     stdv        Test Parameters
-----------------------------------------------------------------
[00] 215.78  223.77  195.02   7.60  |  PNID: -1 NB: 0 Mem: 0 CPU 0
[01] 299.77  307.21  282.93   6.60  |  PNID: -1 NB: 0 Mem: 0 CPU 1
[02] 418.78  449.45  387.53  15.64  |  PNID: -1 NB: 0 Mem: 0 CPU F
[03] 301.27  311.84  280.22   8.98  |  PNID: -1 NB: 0 Mem: 1 CPU 0
[04] 213.60  221.36  190.10   6.53  |  PNID: -1 NB: 0 Mem: 1 CPU 1
[05] 396.37  418.58  376.10  10.15  |  PNID: -1 NB: 0 Mem: 1 CPU F
[06] 402.04  411.85  378.71   8.97  |  PNID: -1 NB: 0 Mem: F CPU 0
[07] 401.28  410.06  384.80   6.41  |  PNID: -1 NB: 0 Mem: F CPU 1
[08] 439.86  459.61  392.28  19.09  |  PNID: -1 NB: 0 Mem: F CPU F

[09] 214.81  225.35  199.34   5.38  |  PNID: -1 NB: 1 Mem: 0 CPU 0
[10] 299.15  314.84  274.00   8.18  |  PNID: -1 NB: 1 Mem: 0 CPU 1
[11] 395.70  425.22  340.33  21.54  |  PNID: -1 NB: 1 Mem: 0 CPU F
[12] 300.43  310.93  281.67   7.40  |  PNID: -1 NB: 1 Mem: 1 CPU 0
[13] 210.86  222.80  189.54   7.55  |  PNID: -1 NB: 1 Mem: 1 CPU 1
[14] 402.57  433.72  299.73  32.96  |  PNID: -1 NB: 1 Mem: 1 CPU F
[15] 390.04  410.10  370.63  10.72  |  PNID: -1 NB: 1 Mem: F CPU 0
[16] 393.32  418.43  370.52  10.71  |  PNID: -1 NB: 1 Mem: F CPU 1
[17] 370.07  424.58  255.16  43.26  |  PNID: -1 NB: 1 Mem: F CPU F

[18] 216.26  224.95  198.62   5.86  |  PNID:  0 NB: 1 Mem: 0 CPU 0
[19] 303.60  314.29  275.32   7.99  |  PNID:  0 NB: 1 Mem: 0 CPU 1
[20] 280.36  316.40  242.15  18.25  |  PNID:  0 NB: 1 Mem: 0 CPU F
[21] 301.17  315.03  283.77   8.07  |  PNID:  0 NB: 1 Mem: 1 CPU 0
[22] 209.34  218.63  187.69   9.11  |  PNID:  0 NB: 1 Mem: 1 CPU 1
[23] 342.34  369.42  311.99  12.79  |  PNID:  0 NB: 1 Mem: 1 CPU F
[24] 399.23  409.19  375.73   8.15  |  PNID:  0 NB: 1 Mem: F CPU 0
[25] 391.67  410.01  372.27  10.88  |  PNID:  0 NB: 1 Mem: F CPU 1
[26] 363.19  396.58  254.56  32.02  |  PNID:  0 NB: 1 Mem: F CPU F

[27] 215.29  224.59  193.76   8.16  |  PNID:  1 NB: 1 Mem: 0 CPU 0
[28] 300.19  312.95  280.26   9.32  |  PNID:  1 NB: 1 Mem: 0 CPU 1
[29] 340.97  362.79  323.94  10.69  |  PNID:  1 NB: 1 Mem: 0 CPU F
[30] 304.41  312.14  283.69   6.59  |  PNID:  1 NB: 1 Mem: 1 CPU 0
[31] 213.58  224.24  191.11   6.98  |  PNID:  1 NB: 1 Mem: 1 CPU 1
[32] 299.73  337.17  266.98  17.04  |  PNID:  1 NB: 1 Mem: 1 CPU F
[33] 395.56  411.33  359.70  12.24  |  PNID:  1 NB: 1 Mem: F CPU 0
[34] 398.52  409.42  377.33   7.28  |  PNID:  1 NB: 1 Mem: F CPU 1
[35] 355.64  377.61  279.13  26.71  |  PNID:  1 NB: 1 Mem: F CPU F

All data is present for completeness, however the analysis can be limited
to just comparing {00,01,02} (PNID=-1, NB=0), {09,10,11} (PNID=-1, NB=1)
and {18,19,20} (PNID=0, NB=1, mem=0, cpu=F).

{00,09,18} are all basically the same when memory and CPU are both
pinned to the same node as expected since neither PNID or NB should
affect scheduling in this case. We see basically the same pattern (values
being near equal) when memory and CPU are pinned in different nodes
{01,10,19}. The interesting analysis in terms of the original problem
(pinned RDMA buffers, tasks floating) is how NB and PNID affect the
case when memory is pinned and the CPU allowed to float. The base
value {02} (PNID=-1, NB=0) is quite a bit worse than when the CPU and
memory are pinned in different nodes. This is similar to the Intel case
where allowing the load balancer to load balance is worse than pinning
tasks and memory on different nodes and while this simply may be an
artifact of the micro benchmark, given that the benchmark is really just
a sum of a large number of access times by the task to memory, it is
representative of the NUMA awareness of scheduler/load-balancer.

We do see that enabling NB (with the default values) does provide some
help {11} versus {02} and that setting PNID to the node where the memory
is at does provide a significant benefit {20} 280.36 versus {11} 395.70
versus {02} 418.78. Unlike the prior Intel results, where PNID=0, NB=1,
mem=0, cpu=F was generally less than pinned on same node {20} 129.20
versus {00} 136.5, on the AMD platform we don't see nearly the same level
of improvement {20} 280.36 versus {00} 215.78.

This can be explained by the relatively small number of CPUs in a node
(16) and that said node contains two 8-CPU LLCs.

Analysis:

As mentioned in the RFC, the entire micro-benchmark can be traced and all
migrations of the benchmark task can be tabulated.  Obviously, a same-core
migration is also a same-llc migration which is also a same-node migration.
Cross-node migrations are however further broken into 'from node 0' and
'to node 0'.


     {00}            CPU: 0, Mem: 0, NB=0, PNID=-1
--------------------------------------------------------------------
     num_migrations_samecore : 1823 num_migrations_samecore : 1683
     num_migrations_same_llc : 3455 num_migrations_same_llc : 3277
     num_migrations_samenode : 914 num_migrations_samenode : 1016
     num_migrations_crossnode: 1 num_migrations_crossnode: 1
       num_migrations_to_0   : 1 num_migrations_to_0   : 1
       num_migrations_from_0 : 0 num_migrations_from_0 : 0
     num_migrations: 6193                  num_migrations: 5977

     {01}            CPU: 1, Mem: 0, NB=0, PNID=-1
---------------------------------------------------------------------
     num_migrations_samecore : 2453 num_migrations_samecore : 2579
     num_migrations_same_llc : 4693 num_migrations_same_llc : 4735
     num_migrations_samenode : 1429 num_migrations_samenode : 1466
     num_migrations_crossnode: 1 num_migrations_crossnode: 1
       num_migrations_to_0   : 0 num_migrations_to_0   : 0
       num_migrations_from_  : 1 num_migrations_from_0 : 1
     num_migrations: 8576                  num_migrations: 8781

In the two cases where both the task's CPU and the memory buffer is
pinned we do see no cross-node migrations (ignoring the first needed to get
on to the correct node in the first place which is due to the benchmark
starting the task in a different node). Why pinning cross-node results
in more migrations in general needs more investigation as this seems fairly
consistent.

     {02}            CPU: F, Mem: 0, NB=0, PNID=-1
---------------------------------------------------------------------
     num_migrations_samecore : 1620 num_migrations_samecore : 1744
     num_migrations_samecore : 1620 num_migrations_samecore : 1744
     num_migrations_same_llc : 3142 num_migrations_same_llc : 2818
     num_migrations_samenode : 853 num_migrations_samenode : 625
     num_migrations_crossnode: 6344 num_migrations_crossnode: 6778
       num_migrations_to_0   : 769 num_migrations_to_0   : 776
       num_migrations_from_0 : 769 num_migrations_from_0 : 777
     num_migrations: 11959                 num_migrations: 11965

     {11}            CPU: F, Mem: 0, NB=1, PNID=-1
---------------------------------------------------------------------
     num_migrations_samecore : 1966 num_migrations_samecore : 1963
     num_migrations_same_llc : 2803 num_migrations_same_llc : 3314
     num_migrations_samenode : 514 num_migrations_samenode : 721
     num_migrations_crossnode: 6833 num_migrations_crossnode: 6618
       num_migrations_to_0   : 818 num_migrations_to_0   : 630
       num_migrations_from_0 : 818 num_migrations_from_0 : 630
     num_migrations: 12116                 num_migrations: 12616

 From the data table, we see that {02} is slightly slower than {11} even
though there are more total migrations. Ultimately, what matters to the
total time is how much time the task spent running on node 0.

     {20}            CPU: F, Mem: 0, NB=1, PNID=0
---------------------------------------------------------------------
     num_migrations_samecore : 1706 num_migrations_samecore : 1663
     num_migrations_same_llc : 2185 num_migrations_same_llc : 2816
     num_migrations_samenode : 591 num_migrations_samenode : 980
     num_migrations_crossnode: 4621 num_migrations_crossnode: 4243
       num_migrations_to_0   : 480 num_migrations_to_0   : 419
       num_migrations_from_0 : 480 num_migrations_from_0 : 418
     num_migrations: 9103                  num_migrations: 9702

The trace results here are more representative of the observed performance
improvements. The cross-node migrations are significantly lower and the
number of migrations away from node 0 are much less.

In summary, the data (relevant copied below) shows that setting a task's
numa_preferred_nid results in a sizable improvement in completion times.

[00] 215.78  223.77  195.02   7.60  |  PNID: -1 NB: 0 Mem: 0 CPU 0
[01] 299.77  307.21  282.93   6.60  |  PNID: -1 NB: 0 Mem: 0 CPU 1
[02] 418.78  449.45  387.53  15.64  |  PNID: -1 NB: 0 Mem: 0 CPU F
[11] 395.70  425.22  340.33  21.54  |  PNID: -1 NB: 1 Mem: 0 CPU F
[20] 280.36  316.40  242.15  18.25  |  PNID:  0 NB: 1 Mem: 0 CPU F


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ