linux-kernel - Re: [PATCH v2 0/1] mm: introduce MADV_DEMOTE/MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bffe178c-bd97-4945-898e-97ba203f503e@redhat.com>
Date: Thu, 1 Aug 2024 14:53:15 +0200
From: David Hildenbrand <david@...hat.com>
To: Zhangrenze <zhang.renze@....com>, "linux-mm@...ck.org"
 <linux-mm@...ck.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>
Cc: "arnd@...db.de" <arnd@...db.de>,
 "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
 "chris@...kel.net" <chris@...kel.net>,
 "jcmvbkbc@...il.com" <jcmvbkbc@...il.com>,
 "James.Bottomley@...senPartnership.com"
 <James.Bottomley@...senPartnership.com>, "deller@....de" <deller@....de>,
 "linux-parisc@...r.kernel.org" <linux-parisc@...r.kernel.org>,
 "tsbogend@...ha.franken.de" <tsbogend@...ha.franken.de>,
 "rdunlap@...radead.org" <rdunlap@...radead.org>,
 "bhelgaas@...gle.com" <bhelgaas@...gle.com>,
 "linux-mips@...r.kernel.org" <linux-mips@...r.kernel.org>,
 "richard.henderson@...aro.org" <richard.henderson@...aro.org>,
 "ink@...assic.park.msu.ru" <ink@...assic.park.msu.ru>,
 "mattst88@...il.com" <mattst88@...il.com>,
 "linux-alpha@...r.kernel.org" <linux-alpha@...r.kernel.org>,
 Jiaoxupo <jiaoxupo@....com>, Zhouhaofan <zhou.haofan@....com>
Subject: Re: [PATCH v2 0/1] mm: introduce MADV_DEMOTE/MADV_PROMOTE

On 01.08.24 11:57, Zhangrenze wrote:
>>> Sure, here's the Scalable Tiered Memory Control (STMC)
>>>
>>> **Background**
>>>
>>> In the era when artificial intelligence, big data analytics, and
>>> machine learning have become mainstream research topics and
>>> application scenarios, the demand for high-capacity and high-
>>> bandwidth memory in computers has become increasingly important.
>>> The emergence of CXL (Compute Express Link) provides the
>>> possibility of high-capacity memory. Although CXL TYPE3 devices
>>> can provide large memory capacities, their access speed is lower
>>> than traditional DRAM due to hardware architecture limitations.
>>>
>>> To enjoy the large capacity brought by CXL memory while minimizing
>>> the impact of high latency, Linux has introduced the Tiered Memory
>>> architecture. In the Tiered Memory architecture, CXL memory is
>>> treated as an independent, slower NUMA NODE, while DRAM is
>>> considered as a relatively faster NUMA NODE. Applications allocate
>>> memory from the local node, and Tiered Memory, leveraging memory
>>> reclamation and NUMA Balancing mechanisms, can transparently demote
>>> physical pages not recently accessed by user processes to the slower
>>> CXL NUMA NODE. However, when user processes re-access the demoted
>>> memory, the Tiered Memory mechanism will, based on certain logic,
>>> decide whether to promote the demoted physical pages back to the
>>> fast NUMA NODE. If the promotion is successful, the memory accessed
>>> by the user process will reside in DRAM; otherwise, it will reside in
>>> the CXL NODE. Through the Tiered Memory mechanism, Linux balances
>>> betweenlarge memory capacity and latency, striving to maintain an
>>> equilibrium for applications.
>>>
>>> **Problem**
>>> Although Tiered Memory strives to balance between large capacity and
>>> latency, specific scenarios can lead to the following issues:
>>>
>>>     1. In scenarios requiring massive computations, if data is heavily
>>>        stored in CXL slow memory and Tiered Memory cannot promptly
>>>        promote this memory to fast DRAM, it will significantly impact
>>>        program performance.
>>>     2. Similar to the scenario described in point 1, if Tiered Memory
>>>        decides to promote these physical pages to fast DRAM NODE, but
>>>        due to limitations in the DRAM NODE promote ratio, these physical
>>>        pages cannot be promoted. Consequently, the program will keep
>>>        running in slow memory.
>>>     3. After an application finishes computing on a large block of fast
>>>        memory, it may not immediately re-access it. Hence, this memory
>>>        can only wait for the memory reclamation mechanism to demote it.
>>>     4. Similar to the scenario described in point 3, if the demotion
>>>        speed is slow, these cold pages will occupy the promotion
>>>        resources, preventing some eligible slow pages from being
>>>        immediately promoted, severely affecting application efficiency.
>>>
>>> **Solution**
>>> We propose the **Scalable Tiered Memory Control (STMC)** mechanism,
>>> which delegates the authority of promoting and demoting memory to the
>>> application. The principle is simple, as follows:
>>>
>>>     1. When an application is preparing for computation, it can promote
>>>        the memory it needs to use or ensure the memory resides on a fast
>>>        NODE.
>>>     2. When an application will not use the memory shortly, it can
>>>        immediately demote the memory to slow memory, freeing up valuable
>>>        promotion resources.
>>>
>>> STMC mechanism is implemented through the madvise system call, providing
>>> two new advice options: MADV_DEMOTE and MADV_PROMOTE. MADV_DEMOTE
>>> advises demote the physical memory to the node where slow memory
>>> resides; this advice only fails if there is no free physical memory on
>>> the slow memory node. MADV_PROMOTE advises retaining the physical memory
>>> in the fast memory; this advice only fails if there are no promotion
>>> slots available on the fast memory node. Benefits brought by STMC
>>> include:
>>>
>>>     1. The STMC mechanism is a variant of on-demand memory management
>>>        designed to let applications enjoy fast memory as much as possible,
>>>        while actively demoting to slow memory when not in use, thus
>>>        freeing up promotion slots for the NODE and allowing it to run in
>>>        an optimized Tiered Memory environment.
>>>     2. The STMC mechanism better balances large capacity and latency.
>>>
>>> **Shortcomings of STMC**
>>> The STMC mechanism requires the caller to manage memory demotion and
>>> promotion. If the memory is not promptly demoting after an promotion,
>>> it may cause issues similar to memory leaks
>> Ehm, that sounds scary. Can you elaborate what's happening here and why
>> it is "similar to memory leaks"?
>>
>>
>> Can you also point out why migrate_pages() is not suitable? I would
>> assume demote/promote is in essence simply migrating memory between nodes.
>>
>> -- 
>> Cheers,
>>
>> David / dhildenb
>>
> 
> Thank you for the response. Below are my points of view. If there are any
> mistakes, I appreciate your understanding:
> 
> 1. In a tiered memory system, fast nodes and slow nodes act as two common
>     memory pools. The system has a certain ratio limit for promotion. For
>     example, a NODE may stipulate that when the available memory is less
>     than 1GB or 1/4 of the node's memory, promotion are prohibited. If we
>     use migrate_pages at this point, it will unrestrictedly promote slow
>     pages to fast memory, which may prevent other processes’ pages that
>     should have been promoted from being promoted. This is what I mean by
>     occupying promotion resources.
> 2. As described in point 1, if we use MADV_PROMOTE to temporarily promote
>     a batch of pages and do not demote them immediately after usage, it
>     will occupy many promotion resources. Other hot pages that need promote
>     will not be able to get promote, which will impact the performance of
>     certain processes.

So, you mean, applications can actively consume "fast memory" and 
"steal" it from other applications? I assume that's what you meant with 
"memory leak".

I would really suggest to *not* call this "similar to memory leaks", in 
your own favor ;)

> 3. MADV_DEMOTE and MADV_PROMOTE only rely on madvise, while migrate_pages
>     depends on libnuma.

Well, you can trivially call that systemcall also without libnuma ;) So 
that shouldn't really make a difference and is rather something that can 
be solved in user space.

> 4. MADV_DEMOTE and MADV_PROMOTE provide a better balance between capacity
>     and latency. They allow hot pages that need promoting to be promoted
>     smoothly and pages that need demoting to be demoted immediately. This
>     helps tiered memory systems to operate more rationally.

Can you summarize why something similar could not be provided by a 
library that builds up on existing functionality, such as migrate_pages? 
It could easily take a look at memory stats to reason whether a 
promotion/demotion makes sense (your example above with the memory 
distribution).

 From the patch itself I read

"MADV_DEMOTE can mark a range of memory pages as cold
pages and immediately demote them to slow memory. MADV_PROMOTE can mark
a range of memory pages as hot pages and immediately promote them to
fast memory"

which sounds to me like migrate_pages / MADV_COLD might be able to 
achieve something similar.

What's the biggest difference that MADV_DEMOTE|MADV_PROMOTE can do better?

-- 
Cheers,

David / dhildenb