[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0384ed3b-498c-365a-6c12-3c297a5a8a0d@amd.com>
Date: Fri, 4 Feb 2022 16:33:57 +0530
From: Bharata B Rao <bharata@....com>
To: Mel Gorman <mgorman@...e.de>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com,
peterz@...radead.org, juri.lelli@...hat.com,
vincent.guittot@...aro.org, dietmar.eggemann@....com,
rostedt@...dmis.org, bsegall@...gle.com, bristot@...hat.com,
dishaa.talreja@....com, Wei Huang <wei.huang2@....com>
Subject: Re: [RFC PATCH v0 1/3] sched/numa: Process based autonuma scan period
framework
On 2/1/2022 7:45 PM, Mel Gorman wrote:
> On Tue, Feb 01, 2022 at 05:52:55PM +0530, Bharata B Rao wrote:
>> On 1/31/2022 5:47 PM, Mel Gorman wrote:
>>> On Fri, Jan 28, 2022 at 10:58:49AM +0530, Bharata B Rao wrote:
>>>> From: Disha Talreja <dishaa.talreja@....com>
>>>>
>>>> Add a new framework that calculates autonuma scan period
>>>> based on per-process NUMA fault stats.
>>>>
>>>> NUMA faults can be classified into different categories, such
>>>> as local vs. remote, or private vs. shared. It is also important
>>>> to understand such behavior from the perspective of a process.
>>>> The per-process fault stats added here will be used for
>>>> calculating the scan period in the adaptive NUMA algorithm.
>>>>
>>>
>>> Be more specific no how the local vs remote, private vs shared states
>>> are reflections of per-task activity of the same.
>>
>> Sure, will document the algorithm better. However the overall thinking
>> here is that the address-space scanning is a per-process activity and
>> hence the scan period value derived from the accumulated per-process
>> faults is more appropriate than calculating per-task (per-thread) scan
>> periods. Participating threads may have their local/shared and private/shared
>> behaviors, but when aggregated at the process level, it gives a better
>> input for eventual scan period variation. The understanding is that individual
>> thread fault rates will start altering the overall process metrics in
>> such a manner that we respond by changing the scan rate to do more aggressive
>> or less aggressive scanning.
>>
>
> I don't have anything to add on your other responses as it would mostly
> be an acknowledgment of your response.
>
> However, the major concern I have is that address-space wide decisions
> on scan rates has no sensible means of adapting to thread-specific
> requirements. I completely agree that it will result in more stable scan
> rates, particularly the adjustments. It also side-steps a problem where
> new threads may start with a scan rate that is completely inappropriate.
>
> However, I worry that it would be limited overall because each thread
> potentially has unique behaviour which is not obvious in a workload like
> NAS where threads are all executing similar instructions on different
> data. For other applications, threads may operate on thread-local areas
> only (low scan rate), others could operate on shared only regresions (high
> scan rate until back off and interleave), threads can has phase behaviour
> (manager thread collecting data from worker threads) and threads can have
> different lifetimes and phase behaviour. Each thread would have a different
> optimal scan rate to decide if memory needs to be migrated to a local node
> or not. I don't see how address-space wide statistics could every be mapped
> back to threads to adapt scan rates based on thread-specific behaviour.
So if all the threads have similar behavior, wouldn't they all arrive at similar
scan period independently and shouldn't that stabilize the overall scan period
variation? But we do see variation in per-thread scan periods and overall benefit
in numbers from per-mm scan period approach for benchmarks like NAS.
And, for a thread that has completely different behaviour in the group, currently
there is no determinism AFAICS on when that would get its chance to update
the scan period and also whether the eventual scanning happens in the areas
of interest for that thread. In that case, isn't changing the scan period in
isolation to cater to that unique thread an overhead on process address space
scanning?
Since process level stats are essentially aggregation of thread level stats,
process level stats will capture thread level behaviour in general. However,
if there are certain threads whose behavior is much very different from other
threads, they should eventually impact the process level behaviour is our
current thinking.
For example, in a micro-benchmark where half the threads have local-only
access and the other half start with remote-all accesses, the initial
behaviour of the two sets are completely different, but we do see that
per-mm scan period approach performing more or less similar to the existing
approach. However if we use further optimization related to tuning the scan
period in response to the detected node imbalance(this optimization patch
wasn't included as part of this initial series), things improve further.
Having said that, there could be corner cases where per-mm approach may not
be able to capture the per-thread behaviour effectively as you note. We would
surely want to explore such cases with per-mm approach to understand the
behaviour better. We can write micro -benchmarks for this but if there already
existing well known benchmarks that exhibit such behaviors at per-thread level,
we are willing to give them a try.
>
> Thread scanning on the other hand can be improved in multiple ways. If
> nothing else, they can do redundant scanning of regions that are
> not relveant to a task which gets increasingly problematic when VSZ
> increases. The obvious problems are
>
> 1. Scan based on page table updates, not address ranges to mitigate
> problems with THP vs base page updates
>
> 2. Move scan delay to be a per-vma structure that is kmalloced if
> necessary instead of being address space wide.
>
> 3. Track what threads access a VMA. The suggestion was to use a unsigned
> long pid_mask and use the lower bits to tag approximately what
> threads access a VMA. Skip VMAs that did not trap a fault. This would
> be approximate because of PID collisions but would reduce scanning
> of areas the thread is not interested in
>
> 4. Track active regions within VMAs. Very coarse tracking, use unsigned
> long to trap what ranges are active
>
> In different ways, this would reduce the amount of scanning work threads
> do and focuses them on regions of relevance to reduce overhead overall
> without losing thread-specific details.
Thanks for these pointers, these are worth exploring. And any approach
for reducing the redundant scanning should complement the current effort
to optimize the scan period calculation.
Regards,
Bharata.
Powered by blists - more mailing lists