[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87jzzz8tgm.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Thu, 02 Mar 2023 16:10:01 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Bharata B Rao <bharata@....com>
Cc: <linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
<mgorman@...e.de>, <peterz@...radead.org>, <mingo@...hat.com>,
<bp@...en8.de>, <dave.hansen@...ux.intel.com>, <x86@...nel.org>,
<akpm@...ux-foundation.org>, <luto@...nel.org>,
<tglx@...utronix.de>, <yue.li@...verge.com>,
<Ravikumar.Bangoria@....com>
Subject: Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
Bharata B Rao <bharata@....com> writes:
> On 27-Feb-23 1:24 PM, Huang, Ying wrote:
>> Thank you very much for detailed data. Can you provide some analysis
>> for your data?
>
> The overhead numbers I shared earlier weren't correct as I
> realized that while obtaining those numbers from function_graph
> tracing, the trace buffer was silently getting overrun. I had to
> reduce the number of memory access iterations to ensure that I get
> the full trace buffer. I will be summarizing the findings
> based on this new numbers below.
>
> Just to recap - The microbenchmark is run on an AMD Genoa
> two node system. The benchmark has two set of threads,
> (one affined to each node) accessing two different chunks
> of memory (chunk size 8G) which are initially allocated
> on first node. The benchmark touches each page in the
> chunk iteratively for a fixed number of iterations (384
> in this case given below). The benchmark score is the
> amount of time it takes to complete the specified number
> of accesses.
>
> Here is the data for the benchmark run:
>
> Time taken or overhead (us) for fault, task_work and sched_switch
> handling
>
> Default IBS
> Fault handling 2875354862 2602455
> Task work handling 139023 24008121
> Sched switch handling 37712
> Total overhead 2875493885 26648288
>
> Default
> -------
> Total Min Max Avg
> do_numa_page 2875354862 0.08 392.13 22.11
> task_numa_work 139023 0.14 5365.77 532.66
> Total 2875493885
>
> IBS
> ---
> Total Min Max Avg
> ibs_overflow_handler 2602455 0.14 103.91 1.29
> task_ibs_access_work 24008121 0.17 485.09 37.65
> hw_access_sched_in 37712 0.15 287.55 1.35
> Total 26648288
>
>
> Default IBS
> Benchmark score(us) 160171762.0 40323293.0
> numa_pages_migrated 2097220 511791
> Overhead per page 1371 52
> Pages migrated per sec 13094 12692
> numa_hint_faults_local 2820311 140856
> numa_hint_faults 38589520 652647
For default, numa_hint_faults >> numa_pages_migrated. It's hard to be
understood. I guess that there aren't many shared pages in the
benchmark? And I guess that the free pages in the target node is enough
too?
> hint_faults_local/hint_faults 7% 22%
>
> Here is the summary:
>
> - In case of IBS, the benchmark completes 75% faster compared to
> the default case. The gain varies based on how many iterations of
> memory accesses we run as part of the benchmark. For 2048 iterations
> of accesses, I have seen a gain of around 50%.
> - The overhead of NUMA balancing (as measured by the time taken in
> the fault handling, task_work time handling and sched_switch time
> handling) in the default case is seen to be pretty high compared to
> the IBS case.
> - The number of hint-faults in the default case is significantly
> higher than the IBS case.
> - The local hint-faults percentage is much better in the IBS
> case compared to the default case.
> - As shown in the graphs (in other threads of this mail thread), in
> the default case, the page migrations start a bit slowly while IBS
> case shows steady migrations right from the start.
> - I have also shown (via graphs in other threads of this mail thread)
> that in IBS case the benchmark is able to steadily increase
> the access iterations over time, while in the default case, the
> benchmark doesn't do forward progress for a long time after
> an initial increase.
Hard to understand this too. Pages are migrated to local, but
performance doesn't improve.
> - Early migrations due to relevant access sampling from IBS,
> is most probably the significant reason for the uplift that IBS
> case gets.
In original kernel, the NUMA page table scanning will delay for a
while. Please check the below comments in task_tick_numa().
/*
* Using runtime rather than walltime has the dual advantage that
* we (mostly) drive the selection from busy threads and that the
* task needs to have done some actual work before we bother with
* NUMA placement.
*/
I think this is generally reasonable, while it's not best for this
micro-benchmark.
Best Regards,
Huang, Ying
> - It is consistently seen that the benchmark in the IBS case manages
> to complete the specified number of accesses even before the entire
> chunk of memory gets migrated. The early migrations are offsetting
> the cost of remote accesses too.
> - In the IBS case, we re-program the IBS counters for the incoming
> task in the sched_switch path. It is seen that this overhead isn't
> that significant to slow down the benchmark.
> - One of the differences between the default case and the IBS case
> is about when the faults-since-last-scan is updated/folded into the
> historical faults stats and subsequent scan period update. Since we
> don't have the notion of scanning in IBS, I have a threshold (number
> of access faults) to determine when to update the historical faults
> and the IBS sample period. I need to check if quicker migrations
> could result from this change.
> - Finally, all this is for the above mentioned microbenchmark. The
> gains on other benchmarks is yet to be evaluated.
>
> Regards,
> Bharata.
Powered by blists - more mailing lists