linux-kernel - Re: [RFC v2 0/5] surface heterogeneous memory performance information

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dc224433-3d09-8f2e-d278-fee98ada2afc@huawei.com>
Date:   Wed, 19 Jul 2017 17:48:58 +0800
From:   Bob Liu <liubo95@...wei.com>
To:     Ross Zwisler <ross.zwisler@...ux.intel.com>,
        <linux-kernel@...r.kernel.org>
CC:     "Anaczkowski, Lukasz" <lukasz.anaczkowski@...el.com>,
        "Box, David E" <david.e.box@...el.com>,
        "Kogut, Jaroslaw" <Jaroslaw.Kogut@...el.com>,
        "Lahtinen, Joonas" <joonas.lahtinen@...el.com>,
        "Moore, Robert" <robert.moore@...el.com>,
        "Nachimuthu, Murugasamy" <murugasamy.nachimuthu@...el.com>,
        "Odzioba, Lukasz" <lukasz.odzioba@...el.com>,
        "Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        "Schmauss, Erik" <erik.schmauss@...el.com>,
        "Verma, Vishal L" <vishal.l.verma@...el.com>,
        "Zheng, Lv" <lv.zheng@...el.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Dan Williams <dan.j.williams@...el.com>,
        Dave Hansen <dave.hansen@...el.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Jerome Glisse <jglisse@...hat.com>,
        Len Brown <lenb@...nel.org>,
        Tim Chen <tim.c.chen@...ux.intel.com>, <devel@...ica.org>,
        <linux-acpi@...r.kernel.org>, <linux-mm@...ck.org>,
        <linux-nvdimm@...ts.01.org>
Subject: Re: [RFC v2 0/5] surface heterogeneous memory performance information

On 2017/7/7 5:52, Ross Zwisler wrote:
> ==== Quick Summary ====
> 
> Platforms in the very near future will have multiple types of memory
> attached to a single CPU.  These disparate memory ranges will have some
> characteristics in common, such as CPU cache coherence, but they can have
> wide ranges of performance both in terms of latency and bandwidth.
> 
> For example, consider a system that contains persistent memory, standard
> DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU.
> There could potentially be an order of magnitude or more difference in
> performance between the slowest and fastest memory attached to that CPU.
> 
> With the current Linux code NUMA nodes are CPU-centric, so all the memory
> attached to a given CPU will be lumped into the same NUMA node.  This makes
> it very difficult for userspace applications to understand the performance
> of different memory ranges on a given CPU.
> 
> We solve this issue by providing userspace with performance information on
> individual memory ranges.  This performance information is exposed via
> sysfs:
> 
>   # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null
>   mem_tgt2/firmware_id:1
>   mem_tgt2/is_cached:0
>   mem_tgt2/is_enabled:1
>   mem_tgt2/is_isolated:0
>   mem_tgt2/phys_addr_base:0x0
>   mem_tgt2/phys_length_bytes:0x800000000
>   mem_tgt2/local_init/read_bw_MBps:30720
>   mem_tgt2/local_init/read_lat_nsec:100
>   mem_tgt2/local_init/write_bw_MBps:30720
>   mem_tgt2/local_init/write_lat_nsec:100
> 
> This allows applications to easily find the memory that they want to use.
> We expect that the existing NUMA APIs will be enhanced to use this new
> information so that applications can continue to use them to select their
> desired memory.
> 
> This series is built upon acpica-1705:
> 
> https://github.com/zetalog/linux/commits/acpica-1705
> 
> And you can find a working tree here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/zwisler/linux.git/log/?h=hmem_sysfs
> 
> ==== Lots of Details ====
> 
> This patch set is only concerned with CPU-addressable memory types, not
> on-device memory like what we have with Jerome Glisse's HMM series:
> 
> https://lwn.net/Articles/726691/
> 
> This patch set works by enabling the new Heterogeneous Memory Attribute
> Table (HMAT) table, newly defined in ACPI 6.2. One major conceptual change
> in ACPI 6.2 related to this work is that proximity domains no longer need
> to contain a processor.  We can now have memory-only proximity domains,
> which means that we can now have memory-only Linux NUMA nodes.
> 
> Here is an example configuration where we have a single processor, one
> range of regular memory and one range of HBM:
> 
>   +---------------+   +----------------+
>   | Processor     |   | Memory         |
>   | prox domain 0 +---+ prox domain 1  |
>   | NUMA node 1   |   | NUMA node 2    |
>   +-------+-------+   +----------------+
>           |
>   +-------+----------+
>   | HBM              |
>   | prox domain 2    |
>   | NUMA node 0      |
>   +------------------+
> 
> This gives us one initiator (the processor) and two targets (the two memory
> ranges).  Each of these three has its own ACPI proximity domain and
> associated Linux NUMA node.  Note also that while there is a 1:1 mapping
> from each proximity domain to each NUMA node, the numbers don't necessarily
> match up.  Additionally we can have extra NUMA nodes that don't map back to
> ACPI proximity domains.
> 
> The above configuration could also have the processor and one of the two
> memory ranges sharing a proximity domain and NUMA node, but for the
> purposes of the HMAT the two memory ranges will always need to be
> separated.
> 
> The overall goal of this series and of the HMAT is to allow users to
> identify memory using its performance characteristics.  This can broadly be
> done in one of two ways:
> 
> Option 1: Provide the user with a way to map between proximity domains and
> NUMA nodes and a way to access the HMAT directly (probably via
> /sys/firmware/acpi/tables).  Then, through possibly a library and a daemon,
> provide an API so that applications can either request information about
> memory ranges, or request memory allocations that meet a given set of
> performance characteristics.
> 
> Option 2: Provide the user with HMAT performance data directly in sysfs,
> allowing applications to directly access it without the need for the
> library and daemon.
> 

Is it possible to do the memory allocation automatically by the kernel and transparent to users?
It sounds like unreasonable that most users should aware this detail memory topology.

--
Thanks,
Bob Liu

> The kernel work for option 1 is started by patches 1-3.  These just surface
> the minimal amount of information in sysfs to allow userspace to map
> between proximity domains and NUMA nodes so that the raw data in the HMAT
> table can be understood.