linux-kernel - Re: RFC: Memory Tiering Kernel Interfaces

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87f8d4d0-6d06-7254-b2a6-3ccf6a555733@linux.alibaba.com>
Date:   Tue, 3 May 2022 10:07:08 +0800
From:   Baolin Wang <baolin.wang@...ux.alibaba.com>
To:     Wei Xu <weixugc@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Huang Ying <ying.huang@...el.com>,
        Dan Williams <dan.j.williams@...el.com>,
        Yang Shi <shy828301@...il.com>, Linux MM <linux-mm@...ck.org>,
        Greg Thelen <gthelen@...gle.com>,
        "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>,
        Jagdish Gediya <jvgediya@...ux.ibm.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Alistair Popple <apopple@...dia.com>,
        Michal Hocko <mhocko@...nel.org>,
        Brice Goglin <brice.goglin@...il.com>,
        Feng Tang <feng.tang@...el.com>, Jonathan.Cameron@...wei.com
Subject: Re: RFC: Memory Tiering Kernel Interfaces



On 5/2/2022 1:58 AM, Davidlohr Bueso wrote:
> Nice summary, thanks. I don't know who of the interested parties will be
> at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
> at 14:00 and 15:00.
> 
> On Fri, 29 Apr 2022, Wei Xu wrote:
> 
>> The current kernel has the basic memory tiering support: Inactive
>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>> tier NUMA node to make room for new allocations on the higher tier
>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>> migrated (promoted) to a higher tier NUMA node to improve the
>> performance.
> 
> Regardless of the promotion algorithm, at some point I see the NUMA hinting
> fault mechanism being in the way of performance. It would be nice if 
> hardware
> began giving us page "heatmaps" instead of having to rely on faulting or
> sampling based ways to identify hot memory.
> 
>> A tiering relationship between NUMA nodes in the form of demotion path
>> is created during the kernel initialization and updated when a NUMA
>> node is hot-added or hot-removed.  The current implementation puts all
>> nodes with CPU into the top tier, and then builds the tiering hierarchy
>> tier-by-tier by establishing the per-node demotion targets based on
>> the distances between nodes.
>>
>> The current memory tiering interface needs to be improved to address
>> several important use cases:
>>
>> * The current tiering initialization code always initializes
>>  each memory-only NUMA node into a lower tier.  But a memory-only
>>  NUMA node may have a high performance memory device (e.g. a DRAM
>>  device attached via CXL.mem or a DRAM-backed memory-only node on
>>  a virtual machine) and should be put into the top tier.
> 
> At least the CXL memory (volatile or not) will still be slower than
> regular DRAM, so I think that we'd not want this to be top-tier. But
> in general, yes I agree that defining top tier as whether or not the
> node has a CPU a bit limiting, as you've detailed here.
> 
>> Tiering Hierarchy Initialization
>> ================================
>>
>> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>>
>> A device driver can remove its memory nodes from the top tier, e.g.
>> a dax driver can remove PMEM nodes from the top tier.
>>
>> The kernel builds the memory tiering hierarchy and per-node demotion
>> order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
>> best distance nodes in the next lower tier are assigned to
>> node_demotion[N].preferred and all the nodes in the next lower tier
>> are assigned to node_demotion[N].allowed.
>>
>> node_demotion[N].preferred can be empty if no preferred demotion node
>> is available for node N.
> 
> Upon cases where there more than one possible demotion node (with equal
> cost), I'm wondering if we want to do something better than choosing
> randomly, like we do now - perhaps round robin? Of course anything
> like this will require actual performance data, something I have seen
> very little of.

I've tried to use round robin[1] to select a target demotion node if 
there are multiple demotion nodes, however I did not see any obvious 
performance gain with mysql testing. Maybe use other test suits?

https://lore.kernel.org/all/c02bcbc04faa7a2c852534e9cd58a91c44494657.1636016609.git.baolin.wang@linux.alibaba.com/