[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bffe178c-bd97-4945-898e-97ba203f503e@redhat.com>
Date: Thu, 1 Aug 2024 14:53:15 +0200
From: David Hildenbrand <david@...hat.com>
To: Zhangrenze <zhang.renze@....com>, "linux-mm@...ck.org"
<linux-mm@...ck.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>
Cc: "arnd@...db.de" <arnd@...db.de>,
"linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
"chris@...kel.net" <chris@...kel.net>,
"jcmvbkbc@...il.com" <jcmvbkbc@...il.com>,
"James.Bottomley@...senPartnership.com"
<James.Bottomley@...senPartnership.com>, "deller@....de" <deller@....de>,
"linux-parisc@...r.kernel.org" <linux-parisc@...r.kernel.org>,
"tsbogend@...ha.franken.de" <tsbogend@...ha.franken.de>,
"rdunlap@...radead.org" <rdunlap@...radead.org>,
"bhelgaas@...gle.com" <bhelgaas@...gle.com>,
"linux-mips@...r.kernel.org" <linux-mips@...r.kernel.org>,
"richard.henderson@...aro.org" <richard.henderson@...aro.org>,
"ink@...assic.park.msu.ru" <ink@...assic.park.msu.ru>,
"mattst88@...il.com" <mattst88@...il.com>,
"linux-alpha@...r.kernel.org" <linux-alpha@...r.kernel.org>,
Jiaoxupo <jiaoxupo@....com>, Zhouhaofan <zhou.haofan@....com>
Subject: Re: [PATCH v2 0/1] mm: introduce MADV_DEMOTE/MADV_PROMOTE
On 01.08.24 11:57, Zhangrenze wrote:
>>> Sure, here's the Scalable Tiered Memory Control (STMC)
>>>
>>> **Background**
>>>
>>> In the era when artificial intelligence, big data analytics, and
>>> machine learning have become mainstream research topics and
>>> application scenarios, the demand for high-capacity and high-
>>> bandwidth memory in computers has become increasingly important.
>>> The emergence of CXL (Compute Express Link) provides the
>>> possibility of high-capacity memory. Although CXL TYPE3 devices
>>> can provide large memory capacities, their access speed is lower
>>> than traditional DRAM due to hardware architecture limitations.
>>>
>>> To enjoy the large capacity brought by CXL memory while minimizing
>>> the impact of high latency, Linux has introduced the Tiered Memory
>>> architecture. In the Tiered Memory architecture, CXL memory is
>>> treated as an independent, slower NUMA NODE, while DRAM is
>>> considered as a relatively faster NUMA NODE. Applications allocate
>>> memory from the local node, and Tiered Memory, leveraging memory
>>> reclamation and NUMA Balancing mechanisms, can transparently demote
>>> physical pages not recently accessed by user processes to the slower
>>> CXL NUMA NODE. However, when user processes re-access the demoted
>>> memory, the Tiered Memory mechanism will, based on certain logic,
>>> decide whether to promote the demoted physical pages back to the
>>> fast NUMA NODE. If the promotion is successful, the memory accessed
>>> by the user process will reside in DRAM; otherwise, it will reside in
>>> the CXL NODE. Through the Tiered Memory mechanism, Linux balances
>>> betweenlarge memory capacity and latency, striving to maintain an
>>> equilibrium for applications.
>>>
>>> **Problem**
>>> Although Tiered Memory strives to balance between large capacity and
>>> latency, specific scenarios can lead to the following issues:
>>>
>>> 1. In scenarios requiring massive computations, if data is heavily
>>> stored in CXL slow memory and Tiered Memory cannot promptly
>>> promote this memory to fast DRAM, it will significantly impact
>>> program performance.
>>> 2. Similar to the scenario described in point 1, if Tiered Memory
>>> decides to promote these physical pages to fast DRAM NODE, but
>>> due to limitations in the DRAM NODE promote ratio, these physical
>>> pages cannot be promoted. Consequently, the program will keep
>>> running in slow memory.
>>> 3. After an application finishes computing on a large block of fast
>>> memory, it may not immediately re-access it. Hence, this memory
>>> can only wait for the memory reclamation mechanism to demote it.
>>> 4. Similar to the scenario described in point 3, if the demotion
>>> speed is slow, these cold pages will occupy the promotion
>>> resources, preventing some eligible slow pages from being
>>> immediately promoted, severely affecting application efficiency.
>>>
>>> **Solution**
>>> We propose the **Scalable Tiered Memory Control (STMC)** mechanism,
>>> which delegates the authority of promoting and demoting memory to the
>>> application. The principle is simple, as follows:
>>>
>>> 1. When an application is preparing for computation, it can promote
>>> the memory it needs to use or ensure the memory resides on a fast
>>> NODE.
>>> 2. When an application will not use the memory shortly, it can
>>> immediately demote the memory to slow memory, freeing up valuable
>>> promotion resources.
>>>
>>> STMC mechanism is implemented through the madvise system call, providing
>>> two new advice options: MADV_DEMOTE and MADV_PROMOTE. MADV_DEMOTE
>>> advises demote the physical memory to the node where slow memory
>>> resides; this advice only fails if there is no free physical memory on
>>> the slow memory node. MADV_PROMOTE advises retaining the physical memory
>>> in the fast memory; this advice only fails if there are no promotion
>>> slots available on the fast memory node. Benefits brought by STMC
>>> include:
>>>
>>> 1. The STMC mechanism is a variant of on-demand memory management
>>> designed to let applications enjoy fast memory as much as possible,
>>> while actively demoting to slow memory when not in use, thus
>>> freeing up promotion slots for the NODE and allowing it to run in
>>> an optimized Tiered Memory environment.
>>> 2. The STMC mechanism better balances large capacity and latency.
>>>
>>> **Shortcomings of STMC**
>>> The STMC mechanism requires the caller to manage memory demotion and
>>> promotion. If the memory is not promptly demoting after an promotion,
>>> it may cause issues similar to memory leaks
>> Ehm, that sounds scary. Can you elaborate what's happening here and why
>> it is "similar to memory leaks"?
>>
>>
>> Can you also point out why migrate_pages() is not suitable? I would
>> assume demote/promote is in essence simply migrating memory between nodes.
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
>
> Thank you for the response. Below are my points of view. If there are any
> mistakes, I appreciate your understanding:
>
> 1. In a tiered memory system, fast nodes and slow nodes act as two common
> memory pools. The system has a certain ratio limit for promotion. For
> example, a NODE may stipulate that when the available memory is less
> than 1GB or 1/4 of the node's memory, promotion are prohibited. If we
> use migrate_pages at this point, it will unrestrictedly promote slow
> pages to fast memory, which may prevent other processes’ pages that
> should have been promoted from being promoted. This is what I mean by
> occupying promotion resources.
> 2. As described in point 1, if we use MADV_PROMOTE to temporarily promote
> a batch of pages and do not demote them immediately after usage, it
> will occupy many promotion resources. Other hot pages that need promote
> will not be able to get promote, which will impact the performance of
> certain processes.
So, you mean, applications can actively consume "fast memory" and
"steal" it from other applications? I assume that's what you meant with
"memory leak".
I would really suggest to *not* call this "similar to memory leaks", in
your own favor ;)
> 3. MADV_DEMOTE and MADV_PROMOTE only rely on madvise, while migrate_pages
> depends on libnuma.
Well, you can trivially call that systemcall also without libnuma ;) So
that shouldn't really make a difference and is rather something that can
be solved in user space.
> 4. MADV_DEMOTE and MADV_PROMOTE provide a better balance between capacity
> and latency. They allow hot pages that need promoting to be promoted
> smoothly and pages that need demoting to be demoted immediately. This
> helps tiered memory systems to operate more rationally.
Can you summarize why something similar could not be provided by a
library that builds up on existing functionality, such as migrate_pages?
It could easily take a look at memory stats to reason whether a
promotion/demotion makes sense (your example above with the memory
distribution).
From the patch itself I read
"MADV_DEMOTE can mark a range of memory pages as cold
pages and immediately demote them to slow memory. MADV_PROMOTE can mark
a range of memory pages as hot pages and immediately promote them to
fast memory"
which sounds to me like migrate_pages / MADV_COLD might be able to
achieve something similar.
What's the biggest difference that MADV_DEMOTE|MADV_PROMOTE can do better?
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists