linux-kernel - [RFC 0/2] How effective is numa_preferred

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20231216001801.3015832-1-chris.hyser@oracle.com>
Date: Fri, 15 Dec 2023 19:17:59 -0500
From: chris hyser <chris.hyser@...cle.com>
To: "Chris Hyser" <chris.hyser@...cle.com>,
        "Peter Zijlstra" <peterz@...radead.org>,
        "Mel Gorman" <mgorman@...e.de>, linux-kernel@...r.kernel.org
Cc: "Konrad Wilk" <konrad.wilk@...cle.com>
Subject: [RFC 0/2] How effective is numa_preferred_nid w.r.t. NUMA performance?

The commentary around the initial Oracle Soft Affinity proposal [1] had
recommended investigating the use of numa_preferred_nid as a better solution.
The primary driver for the original proposal (as well as now) is better NUMA
performance involving important task's accessing RDMA pinned memory. I wanted a
fairly simple test to explore the various aspects of NUMA performance and that
didn't require lots of time running TPC-C on a tuned DB as Subhra had done. I
needed something that would allow both task and memory placement, with usable
NUMA sensitivity and I think I stumbled onto something quite useful. As the test
is only concerned with the NUMA effects of scheduler/balancer placement
decisions, no locks, no communications, no syscalls etc during the timed loop,
it does not represent any actual useful load. Thus making it, I suppose, a NUMA
micro-benchmark.

A simplified description of the resulting benchmark is first a probe process
which times an outer loop doing a specified "counts" worth of a tight inner
loop. The inner loop in sequential mode would access every u64 in a large
buffer, but in this case it is an equivalent number of random (u64 aligned)
indexes into the memory buffer accessed by a 64-bit read then 64-bit write (the
code provides seq vs rand access as well as various access patterns, but this is
the combo most interesting for this). The probe's buffer memory is either
allowed to float or be bound to particular NUMA nodes while also allowing the
NUMA affinity of the process itself to be set (uses hard affinity) as well as
supporting use of the prctl() in patch 2 to set a "Preferred Node Affinity". The
main difference between this and probably dozens of similar programs is that the
probe isn't the benchmark; its just an extremely NUMA sensitive process. If you
run it by itself on an idle system it will park on a CPU, fill up the associated
caches and tell you absolutely nothing. 

What ultimately makes this interesting is running it in the presence of load,
specifically a constant percentage of cpu-only load replicated and pinned on
each CPU. So, for example, HTOP would show all but one CPU at say 60% (what I
used in generating the results here, but the "effect" occurs even with just a 1%
load) with that lone CPU running the probe and pegged at 100%. The result of
this is the load balancer really feeling the need to balance and the NUMA
awareness of those placement decisions are clearly discernible in the probe's
measured times. As well, the runtimes are sufficiently short to enable tracing
the entire life of the probe and categorizing all migrations as 'same core',
'same node', and 'cross node'. 

The above is a minimal description of the benchmark. I will be making this
available if people are interested (that and when I get internal stuff sorted),
so after the holidays.

In terms of showing results, I also have test data for an AMD 8-node and an
ARM64 2-node box. I've also run tests exploring the benchmark over a range of
different migration_cost_ns values.  Again, if people are interested, I have
data to share. 

Test Results:
--------------
The below tests were run on an Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz
box. This has two LLC-spanned memory nodes and 104 CPUs. The kernel was recent
tip:sched/core with the included patches (POC only) just to show the changes.

Key:
-----------------
NB   - auto-numa-balancing (0 - off, 1 - on)
PNID - the prctl() "forced" numa_preferred_nid, ie 'Preferred Node Affinity'.
           (given 2 nodes:  0, 1, and -1 for not_set)
Mem  - represents the Memory node when memory is bound, else 'F' floating,
           ie not set
CPU  - represents the CPUs of the node that the probe is hard-affined to, else
           'F' floating, ie not set
Avg  - the average time of the probe's measurements in secs

Each line below represents the average of 64 test runs with the indicated
parameters.

NumSamples: 64 
Kernel: 6.7.0-rc1_ch_pna7_7+_#213 SMP PREEMPT_DYNAMIC Thu Dec  7 15:16:59 EST 2023
Load: 60
CPU_Model: IntelR XeonR Platinum 8167M CPU @ 2.00GHz
NUM_CPUS: 104
migration_cost_ns: 500000

       Avg       max     min     stdv  |       Test Parameters
----------------------------------------------------------------------
[00]  136.50   141.76   122.08   2.95  |  PNID: -1 NB: 0 Mem: 0 CPU: 0
[01]  168.78   172.07   156.04   2.58  |  PNID: -1 NB: 0 Mem: 0 CPU: 1
[02]  173.15   180.73   153.41   4.89  |  PNID: -1 NB: 0 Mem: 0 CPU: F
[03]  165.95   169.17   162.13   1.57  |  PNID: -1 NB: 0 Mem: 1 CPU: 0
[04]  137.23   144.28   123.75   4.97  |  PNID: -1 NB: 0 Mem: 1 CPU: 1
[05]  179.90   187.21   165.90   3.73  |  PNID: -1 NB: 0 Mem: 1 CPU: F
[06]  163.87   170.68   147.56   6.31  |  PNID: -1 NB: 0 Mem: F CPU: 0
[07]  168.96   174.40   156.51   3.74  |  PNID: -1 NB: 0 Mem: F CPU: 1
[08]  180.71   185.51   169.74   3.33  |  PNID: -1 NB: 0 Mem: F CPU: F
 
[09]  135.68   139.28   119.92   2.93  |  PNID: -1 NB: 1 Mem: 0 CPU: 0
[10]  166.60   169.82   160.05   1.76  |  PNID: -1 NB: 1 Mem: 0 CPU: 1
[11]  171.97   181.91   163.94   3.70  |  PNID: -1 NB: 1 Mem: 0 CPU: F
[12]  164.01   170.34   152.37   2.82  |  PNID: -1 NB: 1 Mem: 1 CPU: 0
[13]  138.01   142.27   135.20   1.22  |  PNID: -1 NB: 1 Mem: 1 CPU: 1
[14]  177.07   184.39   163.89   3.56  |  PNID: -1 NB: 1 Mem: 1 CPU: F
[15]  165.70   171.33   154.46   2.41  |  PNID: -1 NB: 1 Mem: F CPU: 0
[16]  165.18   170.83   149.12   5.99  |  PNID: -1 NB: 1 Mem: F CPU: 1
[17]  148.91   163.04   134.31   5.48  |  PNID: -1 NB: 1 Mem: F CPU: F

[18]  135.63   138.63   122.85   2.07  |  PNID:  0 NB: 1 Mem: 0 CPU: 0
[19]  162.38   170.60   146.03   6.73  |  PNID:  0 NB: 1 Mem: 0 CPU: 1
[20]  129.20   135.26   114.55   3.28  |  PNID:  0 NB: 1 Mem: 0 CPU: F
[21]  161.71   168.72   144.87   5.55  |  PNID:  0 NB: 1 Mem: 1 CPU: 0
[22]  135.72   140.44   123.34   3.10  |  PNID:  0 NB: 1 Mem: 1 CPU: 1
[23]  155.07   162.20   138.71   4.50  |  PNID:  0 NB: 1 Mem: 1 CPU: F
[24]  163.42   169.29   146.95   5.04  |  PNID:  0 NB: 1 Mem: F CPU: 0
[25]  165.90   170.44   157.56   1.67  |  PNID:  0 NB: 1 Mem: F CPU: 1
[26]  140.45   148.37   117.02   5.81  |  PNID:  0 NB: 1 Mem: F CPU: F

[27]  135.26   140.78   123.29   2.30  |  PNID:  1 NB: 1 Mem: 0 CPU: 0
[28]  166.22   169.51   148.18   2.65  |  PNID:  1 NB: 1 Mem: 0 CPU: 1
[29]  157.91   165.94   153.48   2.75  |  PNID:  1 NB: 1 Mem: 0 CPU: F
[30]  162.08   166.76   148.14   3.37  |  PNID:  1 NB: 1 Mem: 1 CPU: 0
[31]  136.86   140.03   127.42   2.01  |  PNID:  1 NB: 1 Mem: 1 CPU: 1
[32]  131.85   141.38   114.66   5.55  |  PNID:  1 NB: 1 Mem: 1 CPU: F
[33]  163.64   169.48   149.35   2.74  |  PNID:  1 NB: 1 Mem: F CPU: 0
[34]  165.94   170.47   156.10   2.41  |  PNID:  1 NB: 1 Mem: F CPU: 1
[35]  145.28   154.64   137.84   3.60  |  PNID:  1 NB: 1 Mem: F CPU: F

Observations:
---------------
First we see the expected results that memory and cpu bound/pinned on the same
node {0,4,9,13,18,22,27,31} is quite a bit faster than when bound/pinned on
different nodes {1,3,10,12,19,21,28,30}. Completely unexpected was that when
binding memory to a node but allowing the CPU to float (ie, let the scheduler
"schedule", the load balancer "balance") or both float, the performance is as
bad or worse than pinning CPU's and memory on different nodes {2,5,8,11,14}. NB
does help when both memory and the CPU float.

How is that possible? I did some traces of the probe with identical
params/kernel etc. These were then categorized as "same-core", "same-node (minus
same core)", and "cross-node".

Given this platform, a reasonable hypothesis is that cross-node migrations are
trashing the LLC and that is a big deal from a pure NUMA perspective. Is there a
general correlation between the number of cross-node migrations and the longer
completion times?  The answer I believe is yes. (The below are representative
samples versus averages as there is still a manual step.)

When both memory and the CPUs are pinned (same node or diff) we see no
cross-node migrations (the 1 is from when the probe started on a different node
than it later hard affined to)

		    CPU: 0, Mem: 0, NB=0, PNID=-1
    -----------------------------------------------------------------
    num_migrations_samecore : 846       num_migrations_samecore : 887
    num_migrations_samenode : 2442      num_migrations_samenode : 2375
    num_migrations_crossnode: 1         num_migrations_crossnode: 1
    num_migrations: 3289                num_migrations: 3263

		    CPU: 1, Mem: 1, NB=0, PNID=-1
    -----------------------------------------------------------------
    num_migrations_samecore : 822       num_migrations_samecore : 886
    num_migrations_samenode : 2156      num_migrations_samenode : 1982
    num_migrations_crossnode: 0         num_migrations_crossnode: 0
    num_migrations: 2978                num_migrations: 2868

		    CPU: 0, Mem: 1, NB=0, PNID=-1
    -----------------------------------------------------------------
    num_migrations_samecore : 1038      num_migrations_samecore : 1055
    num_migrations_samenode : 2892      num_migrations_samenode : 2824
    num_migrations_crossnode: 0         num_migrations_crossnode: 1
    num_migrations: 3931                num_migrations: 3879


Compared to both CPU and memory allowed to float (as well as the impact of NB
and PNID):
		    CPU: F, Mem: F, NB=0, PNID=-1
    -----------------------------------------------------------------
    num_migrations_samecore : 681       num_migrations_samecore : 800
    num_migrations_samenode : 2306      num_migrations_samenode : 2255
    num_migrations_crossnode: 1548      num_migrations_crossnode: 1503
    num_migrations: 4535                num_migrations: 4558

		    CPU: F, Mem: F, NB=1, PNID=-1
    -----------------------------------------------------------------
    num_migrations_samecore : 799       num_migrations_samecore : 646
    num_migrations_samenode : 3098      num_migrations_samenode : 2775
    num_migrations_crossnode: 104       num_migrations_crossnode: 236
    num_migrations: 4001                num_migrations: 3657

		    CPU: F, Mem: F, NB=1, PNID=0
    -----------------------------------------------------------------
    num_migrations_samecore : 718       num_migrations_samecore : 737
    num_migrations_samenode : 3148      num_migrations_samenode : 3274
    num_migrations_crossnode: 2         num_migrations_crossnode: 7 
    num_migrations: 3868                num_migrations: 4018

We see that NB does have a big impact (decrease in cross-node migrations) and
confirmed by much better measured times. line {17} vs line {8}.

In terms of the primary use case, pinned RDMA mem buffers, the interesting
results are where the CPU is allowed to float with memory pinned
{2,5,8,11,14,17,20,23,26,29,32,35}. What do the migration counts look like for
those accesses:

		    CPU: F, Mem: 0, NB=0, PNID=-1
    -----------------------------------------------------------------
    num_migrations_samecore : 788       num_migrations_samecore : 739
    num_migrations_samenode : 2251      num_migrations_samenode : 2292
    num_migrations_crossnode: 1738      num_migrations_crossnode: 1500
    num_migrations: 4777                num_migrations: 4531 

		    CPU: F, Mem: 0, NB=1, PNID=-1
    -----------------------------------------------------------------
    num_migrations_samecore : 663       num_migrations_samecore : 657
    num_migrations_samenode : 2434      num_migrations_samenode : 2427
    num_migrations_crossnode: 1344      num_migrations_crossnode: 1499
    num_migrations: 4441                num_migrations: 4583

		    CPU: F, Mem: 0, NB=1, PNID=0
    -----------------------------------------------------------------
    num_migrations_samecore : 653       num_migrations_samecore : 665
    num_migrations_samenode : 2954      num_migrations_samenode : 2880
    num_migrations_crossnode: 7         num_migrations_crossnode: 12
    num_migrations: 3614                num_migrations: 3557

>From a purely NUMA perspective, accurately setting the preferred node from user
space, "Preferred Node Affinity", appears to be a substantial win as can be seen
by comparing lines {2, 11} vs line {20} and lines {5, 14} vs line {32}. 

We also see that NB does not have nearly the same effect with the CPU node
floating and the memory bound as when both were floating. The function
task_numa_work() does clearly skip non-migratable VMAs. The issue with this is
that when enabling NB, the most important accesses of some tasks aren't tracked,
while the accesses that are can lead to the wrong value for numa_preferred_nid,
and thus NB gets turned off.

On digging into this further, there was a 2014 presentation "Automatic NUMA
Balancing" [2] which declares support for "unmovable" memory as a future,
recognizes it's value in correctly setting numa_preferred_nid, but says it is
unclear if it is worthwhile. I am currently working on enabling this and running
such tests. 

As a final note, I will have a chance to validate the effects of these changes
against the DB next month.


[1] [RFC PATCH 0/3] Scheduler Soft Affinity
https://lore.kernel.org/lkml/20190626224718.21973-1-subhra.mazumdar@oracle.com/

[2] "Automatic NUMA Balancing",
https://events.static.linuxfound.org/sites/events/files/slides/summit2014_riel_chegu_w_0340_automatic_numa_balancing_0.pdf