linux-kernel - Re: [RFC -V2 3/8] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <87o8tvglii.fsf@yhuang-dev.intel.com>
Date:   Wed, 19 Feb 2020 14:05:09 +0800
From:   "Huang\, Ying" <ying.huang@...el.com>
To:     Mel Gorman <mgorman@...e.de>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>, <linux-mm@...ck.org>,
        <linux-kernel@...r.kernel.org>, Feng Tang <feng.tang@...el.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Michal Hocko" <mhocko@...e.com>, Rik van Riel <riel@...hat.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Dan Williams <dan.j.williams@...el.com>
Subject: Re: [RFC -V2 3/8] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM

Mel Gorman <mgorman@...e.de> writes:

> On Tue, Feb 18, 2020 at 04:26:29PM +0800, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@...el.com>
>> 
>> In a memory tiering system, if the memory size of the workloads is
>> smaller than that of the faster memory (e.g. DRAM) nodes, all pages of
>> the workloads should be put in the faster memory nodes.  But this
>> makes it unnecessary to use slower memory (e.g. PMEM) at all.
>> 
>> So in common cases, the memory size of the workload should be larger
>> than that of the faster memory nodes.  And to optimize the
>> performance, the hot pages should be promoted to the faster memory
>> nodes while the cold pages should be demoted to the slower memory
>> nodes.  To achieve that, we have two choices,
>> 
>> a. Promote the hot pages from the slower memory node to the faster
>>    memory node.  This will create some memory pressure in the faster
>>    memory node, thus trigger the memory reclaiming, where the cold
>>    pages will be demoted to the slower memory node.
>> 
>> b. Demote the cold pages from faster memory node to the slower memory
>>    node.  This will create some free memory space in the faster memory
>>    node, and the hot pages in the slower memory node could be promoted
>>    to the faster memory node.
>> 
>> The choice "a" will create the memory pressure in the faster memory
>> node.  If the memory pressure of the workload is high too, the memory
>> pressure may become so high that the memory allocation latency of the
>> workload is influenced, e.g. the direct reclaiming may be triggered.
>> 
>> The choice "b" works much better at this aspect.  If the memory
>> pressure of the workload is high, it will consume the free memory and
>> the hot pages promotion will stop earlier if its allocation watermark
>> is higher than that of the normal memory allocation.
>> 
>> In this patch, choice "b" is implemented.  If memory tiering NUMA
>> balancing mode is enabled, the node isn't the slowest node, and the
>> free memory size of the node is below the high watermark, the kswapd
>> of the node will be waken up to free some memory until the free memory
>> size is above the high watermark + autonuma promotion rate limit.  If
>> the free memory size is below the high watermark, autonuma promotion
>> will stop working.  This avoids to create too much memory pressure to
>> the system.
>> 
>> Signed-off-by: "Huang, Ying" <ying.huang@...el.com>
>
> Unfortunately I stopped reading at this point. It depends on another series
> entirely and they really need to be presented together instead of relying
> on searching mail archives to find other patches to try assemble the full
> picture :(. Ideally each stage would have supporting data showing roughly
> how it behaves at each major stage. I know this will be a pain but the
> original NUMA balancing had the same problem and ultimately started with
> one large series that got the basics right followed by other series that
> improved it in stages. That process is *still* ongoing today.

Sorry for inconvenience, we will post a new patchset including both
series and add supporting data at each major stage when possible.

Best Regards,
Huang, Ying