linux-kernel - Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b19e0c22-c80b-7223-6ed7-472502948fa0@amd.com>
Date:   Wed, 1 Mar 2023 16:51:25 +0530
From:   Bharata B Rao <bharata@....com>
To:     "Huang, Ying" <ying.huang@...el.com>
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org, mgorman@...e.de,
        peterz@...radead.org, mingo@...hat.com, bp@...en8.de,
        dave.hansen@...ux.intel.com, x86@...nel.org,
        akpm@...ux-foundation.org, luto@...nel.org, tglx@...utronix.de,
        yue.li@...verge.com, Ravikumar.Bangoria@....com
Subject: Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing

On 27-Feb-23 1:24 PM, Huang, Ying wrote:
> Thank you very much for detailed data.  Can you provide some analysis
> for your data?

The overhead numbers I shared earlier weren't correct as I
realized that while obtaining those numbers from function_graph
tracing, the trace buffer was silently getting overrun. I had to
reduce the number of memory access iterations to ensure that I get
the full trace buffer. I will be summarizing the findings
based on this new numbers below.

Just to recap - The microbenchmark is run on an AMD Genoa
two node system. The benchmark has two set of threads,
(one affined to each node) accessing two different chunks
of memory (chunk size 8G) which are initially allocated
on first node. The benchmark touches each page in the
chunk iteratively for a fixed number of iterations (384
in this case given below). The benchmark score is the
amount of time it takes to complete the specified number
of accesses.

Here is the data for the benchmark run:

Time taken or overhead (us) for fault, task_work and sched_switch
handling

				Default		IBS
Fault handling			2875354862	2602455		
Task work handling		139023		24008121
Sched switch handling				37712
Total overhead			2875493885	26648288	

Default
-------
			Total		Min	Max		Avg
do_numa_page		2875354862	0.08	392.13		22.11
task_numa_work		139023		0.14	5365.77		532.66
Total			2875493885

IBS
---
			Total		Min	Max		Avg
ibs_overflow_handler	2602455		0.14	103.91		1.29
task_ibs_access_work	24008121	0.17	485.09		37.65
hw_access_sched_in	37712		0.15	287.55		1.35
Total			26648288


				Default		IBS
Benchmark score(us)		160171762.0	40323293.0
numa_pages_migrated		2097220		511791
Overhead per page		1371		52
Pages migrated per sec		13094		12692
numa_hint_faults_local		2820311		140856
numa_hint_faults		38589520	652647
hint_faults_local/hint_faults	7%		22%

Here is the summary:

- In case of IBS, the benchmark completes 75% faster compared to
  the default case. The gain varies based on how many iterations of
  memory accesses we run as part of the benchmark. For 2048 iterations
  of accesses, I have seen a gain of around 50%.
- The overhead of NUMA balancing (as measured by the time taken in
  the fault handling, task_work time handling and sched_switch time
  handling) in the default case is seen to be pretty high compared to
  the IBS case.
- The number of hint-faults in the default case is significantly
  higher than the IBS case.
- The local hint-faults percentage is much better in the IBS
  case compared to the default case.
- As shown in the graphs (in other threads of this mail thread), in
  the default case, the page migrations start a bit slowly while IBS
  case shows steady migrations right from the start.
- I have also shown (via graphs in other threads of this mail thread)
  that in IBS case the benchmark is able to steadily increase
  the access iterations over time, while in the default case, the
  benchmark doesn't do forward progress for a long time after
  an initial increase.
- Early migrations due to relevant access sampling from IBS,
  is most probably the significant reason for the uplift that IBS
  case gets.
- It is consistently seen that the benchmark in the IBS case manages
  to complete the specified number of accesses even before the entire
  chunk of memory gets migrated. The early migrations are offsetting
  the cost of remote accesses too.
- In the IBS case, we re-program the IBS counters for the incoming
  task in the sched_switch path. It is seen that this overhead isn't
  that significant to slow down the benchmark.
- One of the differences between the default case and the IBS case
  is about when the faults-since-last-scan is updated/folded into the
  historical faults stats and subsequent scan period update. Since we
  don't have the notion of scanning in IBS, I have a threshold (number
  of access faults) to determine when to update the historical faults
  and the IBS sample period. I need to check if quicker migrations
  could result from this change.
- Finally, all this is for the above mentioned microbenchmark. The
  gains on other benchmarks is yet to be evaluated.

Regards,
Bharata.