[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56872982-8676-4d65-85ef-b894728db18b@amd.com>
Date: Fri, 14 Mar 2025 07:26:55 +0530
From: Raghavendra K T <raghavendra.kt@....com>
To: "Huang, Ying" <ying.huang@...ux.alibaba.com>, linux-mm@...ck.org,
akpm@...ux-foundation.org, lsf-pc@...ts.linux-foundation.org,
bharata@....com, gourry@...rry.net, nehagholkar@...a.com,
abhishekd@...a.com, nphamcs@...il.com, hannes@...xchg.org,
feng.tang@...el.com, kbusch@...a.com, Hasan.Maruf@....com, sj@...nel.org,
david@...hat.com, willy@...radead.org, k.shutemov@...il.com,
mgorman@...hsingularity.net, vbabka@...e.cz, hughd@...gle.com,
rientjes@...gle.com, shy828301@...il.com, liam.howlett@...cle.com,
peterz@...radead.org, mingo@...hat.com, nadav.amit@...il.com,
shivankg@....com, ziy@...dia.com, jhubbard@...dia.com,
AneeshKumar.KizhakeVeetil@....com, linux-kernel@...r.kernel.org,
jon.grimm@....com, santosh.shukla@....com, Michael.Day@....com,
riel@...riel.com, weixugc@...gle.com, leesuyeon0506@...il.com,
honggyu.kim@...com, leillc@...gle.com, kmanaouil.dev@...il.com,
rppt@...nel.org, dave.hansen@...el.com, dongjoo.linux.dev@...il.com
Subject: Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion
based on PTE A bit scanning
On 2/8/2025 12:36 AM, Davidlohr Bueso wrote:
> On Sun, 26 Jan 2025, Huang, Ying wrote:
>
>> Hi, Raghavendra,
>>
>> Raghavendra K T <raghavendra.kt@....com> writes:
>>
>>> Bharata and I would like to propose the following topic for LSFMM.
>>>
>>> Topic: Overhauling hot page detection and promotion based on PTE A
>>> bit scanning.
>>>
>>> In the Linux kernel, hot page information can potentially be obtained
>>> from
>>> multiple sources:
>>>
>>> a. PROT_NONE faults (NUMA balancing)
>>> b. PTE Access bit (LRU scanning)
>>> c. Hardware provided page hotness info (like AMD IBS)
>>>
>>> This information is further used to migrate (or promote) pages from
>>> slow memory
>>> tier to top tier to increase performance.
>>>
>>> In the current hot page promotion mechanism, all the activities
>>> including the
>>> process address space scanning, NUMA hint fault handling and page
>>> migration are
>>> performed in the process context. i.e., scanning overhead is borne by
>>> the
>>> applications.
>>>
>>> I had recently posted a patch [1] to improve this in the context of
>>> slow-tier
>>> page promotion. Here, Scanning is done by a global kernel thread
>>> which routinely
>>> scans all the processes' address spaces and checks for accesses by
>>> reading the
>>> PTE A bit. The hot pages thus identified are maintained in list and
>>> subsequently
>>> are promoted to a default top-tier node. Thus, the approach pushes
>>> overhead of
>>> scanning, NUMA hint faults and migrations off from process context.
>
> It seems that overall having a global view of hot memory is where folks
> are leaning
> towards. In the past we have discussed an external thread to harvest
> information
> from different sources and do the corresponding migration. I think your
> work is a
> step in this direction (and shows promising numbers), but I'm not sure
> if it should
> be doing the scanning part, as opposed to just receive the information
> and migrate
> (according to some policy based on a wider system view of what is hot;
> ie: what CHMU
> says is hot might not be so hot to the rest of the system, or as is
> pointed out
> below, workload based, as priorities).
>
>>
>> This has been discussed before too. For example, in the following thread
>>
>> https://lore.kernel.org/
>> all/20200417100633.GU20730@...ez.programming.kicks-ass.net/T/
>>
>> The drawbacks of asynchronous scanning including
>>
>> - The CPU cycles used are not charged properly
>>
>> - There may be no idle CPU cycles to use
>>
>> - The scanning CPU may be not near the workload CPUs enough
>
> One approach we experimented with was doing only the page migration
> asynchronously,
> leaving the scanning to the task context, which also knows the dest numa
> node.
> Results showed that page fault latencies were reduced without affecting
> benchmark
> performance. Of course busy systems are an issue, as the window between
> servicing
> the fault and actually making it available to the user in fast memory is
> enlarged.
>
>> It's better to involve Mel and Peter in the discussion for this.
>>
>>> The topic was presented in the MM alignment session hosted by David
>>> Rientjes [2].
>>> The topic also finds a mention in S J Park's LSFMM proposal [3].
>>>
>>> Here is the list of potential discussion points:
>>> 1. Other improvements and enhancements to PTE A bit scanning
>>> approach. Use of
>>> multiple kernel threads, throttling improvements, promotion policies,
>>> per-process
>>> opt-in via prctl, virtual vs physical address based scanning, tuning
>>> hot page
>>> detection algorithm etc.
>>
>> One drawback of physical address based scanning is that it's hard to
>> apply some workload specific policy. For example, if a low priority
>> workload has many relatively hot pages, while a high priority workload
>> has many relative warm (not so hot) pages. We need to promote the warm
>> pages in the high priority workload, while physcial address based
>> scanning may report the hot pages in the low priority workload. Right?
>>
>>> 2. Possibility of maintaining single source of truth for page hotness
>>> that would
>>> maintain hot page information from multiple sources and let other
>>> sub-systems
>>> use that info.
>>>
>>> 3. Discuss how hardware provided hotness info (like AMD IBS) can
>>> further aid
>>> promotion. Bharata had posted an RFC [4] on this a while back.
>>>
>>> 4. Overlap with DAMON and potential reuse.
>>>
>>> Links:
>>>
>>> [1] https://lore.kernel.org/all/20241201153818.2633616-1-
>>> raghavendra.kt@....com/
>>> [2] https://lore.kernel.org/linux-
>>> mm/20241226012833.rmmbkws4wdhzdht6@...ac.uk/T/
>>> [3] https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-
>>> PF4VCD3F/T/
>>> [4] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/
Hello All,
Sorry to comeback late on this. But after "Unifying source of page
temperature discussion",
I was trying to get one step closer towards that. (along with Bharata).
(also sometime spent on failed muti-threaded scanning that perhaps needs
more time if it is needed).
I am posting a single patch which is still in "raw" state (as reply to
this email). I will cleanup, split the patch and post early next week.
Sending this so to have a gist of what is coming atleast before LSFMM.
So here are the list of implemented feedback that we can build further
(depending on the consensus).
1. Scanning and migration is separated. A separate migration thread is
created.
Potential improvements that can be done here:
- Have one instance of migration thread per node.
- API to accept hot pages for promotion from different sources
(for e.g., IBS / LRU as Bharata already mentioned)
- Controlling throttling similar to what Huang has done in NUMAB=2 case
- Take both PFN and folio as argument for migration
- Make use of batch migration enhancements
- usage of per mm migration list to have a easy lookup and control
(using mmslot, This also helps build upon identifying actual hot pages
(2 subsequent access) than single access.)
2. Implemented David's (Rientjes) suggestion of having a prctl approach.
Currently prctl values can range from 0..10.
0 is for disabling
>1 for enabling. But in the future idea is to use this as controlling
scan rate further.
3. Steves' comment on tracing incorporated
4. Davidlohr's reported issue on the path series is fixed
5. Very importantly,
I do have a basic algorithm that detects "target node for migration"
which was the main pain point for PTE A bit scanning.
Algorithm:
As part of our scanning we are doing, scan of top tier pages also.
During the scan, How many pages
- scanned/accessed that belongs to particular toptier/slowtier node
is also recorded.
Currently my algorithm chooses the toptier node that had the maximum
pages scanned.
But we can really build complex algorithm using scanned/accessed recently.
(for e.g. decay last scanned/accessed info, if current topteir node
becomes nearly becomes full find next preferred node, thus using
nodemask/or preferred list instead of single node etc).
Potential improvements on scanning part can be use of complex data
structures to maintain area of hotpages similar to what DAMON is doing
or reuse some infrastructure from DAMON.
Thanks and Regards
- Raghu
Powered by blists - more mailing lists