[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <72b6ec8b-f141-3807-d7f2-f853b0f0b76c@amd.com>
Date: Mon, 13 Feb 2023 11:22:12 +0530
From: Bharata B Rao <bharata@....com>
To: "Huang, Ying" <ying.huang@...el.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, mgorman@...e.de,
peterz@...radead.org, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, x86@...nel.org,
akpm@...ux-foundation.org, luto@...nel.org, tglx@...utronix.de,
yue.li@...verge.com, Ravikumar.Bangoria@....com
Subject: Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
On 2/13/2023 8:56 AM, Huang, Ying wrote:
> Bharata B Rao <bharata@....com> writes:
>
>> Hi,
>>
>> Some hardware platforms can provide information about memory accesses
>> that can be used to do optimal page and task placement on NUMA
>> systems. AMD processors have a hardware facility called Instruction-
>> Based Sampling (IBS) that can be used to gather specific metrics
>> related to instruction fetch and execution activity. This facility
>> can be used to perform memory access profiling based on statistical
>> sampling.
>>
>> This RFC is a proof-of-concept implementation where the access
>> information obtained from the hardware is used to drive NUMA balancing.
>> With this it is no longer necessary to scan the address space and
>> introduce NUMA hint faults to build task-to-page association. Hence
>> the approach taken here is to replace the address space scanning plus
>> hint faults with the access information provided by the hardware.
>
> You method can avoid the address space scanning, but cannot avoid memory
> access fault in fact. PMU will raise NMI and then task_work to process
> the sampled memory accesses. The overhead depends on the frequency of
> the memory access sampling. Please measure the overhead of your method
> in details.
Yes, the address space scanning is avoided. I will measure the overhead
of hint fault vs NMI handling path. The actual processing of the access
from task_work context is pretty much similar to the stats processing
from hint faults. As you note the overhead depends on the frequency of
sampling. In this current approach, the sampling period is per-task
and it varies based on the same logic that NUMA balancing uses to
vary the scan period.
>
>> The access samples obtained from hardware are fed to NUMA balancing
>> as fault-equivalents. The rest of the NUMA balancing logic that
>> collects/aggregates the shared/private/local/remote faults and does
>> pages/task migrations based on the faults is retained except that
>> accesses replace faults.
>>
>> This early implementation is an attempt to get a working solution
>> only and as such a lot of TODOs exist:
>>
>> - Perf uses IBS and we are using the same IBS for access profiling here.
>> There needs to be a proper way to make the use mutually exclusive.
>> - Is tying this up with NUMA balancing a reasonable approach or
>> should we look at a completely new approach?
>> - When accesses replace faults in NUMA balancing, a few things have
>> to be tuned differently. All such decision points need to be
>> identified and appropriate tuning needs to be done.
>> - Hardware provided access information could be very useful for driving
>> hot page promotion in tiered memory systems. Need to check if this
>> requires different tuning/heuristics apart from what NUMA balancing
>> already does.
>> - Some of the values used to program the IBS counters like the sampling
>> period etc may not be the optimal or ideal values. The sample period
>> adjustment follows the same logic as scan period modification which
>> may not be ideal. More experimentation is required to fine-tune all
>> these aspects.
>> - Currently I am acting (i,e., attempt to migrate a page) on each sampled
>> access. Need to check if it makes sense to delay it and do batched page
>> migration.
>
> You current implementation is tied with AMD IBS. You will need a
> architecture/vendor independent framework for upstreaming.
I have tried to keep it vendor and arch neutral as far
as possible, will re-look into this of course to make the
interfaces more robust and useful.
I have defined a static key (hw_access_hints=false) which will be
set only by the platform driver when it detects the hardware
capability to provide memory access information. NUMA balancing
code skips the address space scanning when it sees this capability.
The platform driver (access fault handler) will call into the NUMA
balancing API with linear and physical address information of the
accessed sample. Hence any equivalent hardware functionality could
plug into this scheme in its current form. There are checks for this
static key in the NUMA balancing logic at a few points to decide if
it should work based on access faults or hint faults.
>
> BTW: can IBS sampling memory writing too? Or just memory reading?
IBS can tag both store and load operations.
>
>> This RFC is mainly about showing how hardware provided access
>> information could be used for NUMA balancing but I have run a
>> few basic benchmarks from mmtests to check if this is any severe
>> regression/overhead to any of those. Some benchmarks show some
>> improvement, some show no significant change and a few regress.
>> I am hopeful that with more appropriate tuning there is scope for
>> futher improvement here especially for workloads for which NUMA
>> matters.
>
> What's your expected improvement of the PMU based NUMA balancing? It
> should come from reduced overhead? higher accuracy? Quicker response?
> I think that it may be better to prove that with appropriate statistics
> for at least one workload.
Just to clarify, unlike PEBS, IBS works independently of PMU.
I believe the improvement will come from reduced overhead due to
sampling of relevant accesses only.
I have a microbenchmark where two sets of threads bound to two
NUMA nodes access the two different halves of memory which is
initially allocated on the 1st node.
On a two node Zen4 system, with 64 threads in each set accessing
8G of memory each from the initial allocation of 16G, I see that
IBS driven NUMA balancing (i,e., this patchset) takes 50% less time
to complete a fixed number of memory accesses. This could well
be the best case and real workloads/benchmarks may not get this much
uplift, but it does show the potential gain to be had.
Thanks for your inputs.
Regards,
Bharata.
Powered by blists - more mailing lists