linux-kernel - Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <42006749-7912-1e97-8ccd-945e82cebdde@intel.com>
Date:   Tue, 4 Dec 2018 17:06:49 -0800
From:   Dave Hansen <dave.hansen@...el.com>
To:     Jerome Glisse <jglisse@...hat.com>
Cc:     linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
        linux-kernel@...r.kernel.org,
        "Rafael J . Wysocki" <rafael@...nel.org>,
        Matthew Wilcox <willy@...radead.org>,
        Ross Zwisler <ross.zwisler@...ux.intel.com>,
        Keith Busch <keith.busch@...el.com>,
        Dan Williams <dan.j.williams@...el.com>,
        Haggai Eran <haggaie@...lanox.com>,
        Balbir Singh <bsingharora@...il.com>,
        "Aneesh Kumar K . V" <aneesh.kumar@...ux.ibm.com>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Felix Kuehling <felix.kuehling@....com>,
        Philip Yang <Philip.Yang@....com>,
        Christian König <christian.koenig@....com>,
        Paul Blinzer <Paul.Blinzer@....com>,
        Logan Gunthorpe <logang@...tatee.com>,
        John Hubbard <jhubbard@...dia.com>,
        Ralph Campbell <rcampbell@...dia.com>,
        Michal Hocko <mhocko@...nel.org>,
        Jonathan Cameron <jonathan.cameron@...wei.com>,
        Mark Hairgrove <mhairgrove@...dia.com>,
        Vivek Kini <vkini@...dia.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Dave Airlie <airlied@...hat.com>,
        Ben Skeggs <bskeggs@...hat.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Rik van Riel <riel@...riel.com>,
        Ben Woodard <woodard@...hat.com>, linux-acpi@...r.kernel.org
Subject: Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

On 12/4/18 4:15 PM, Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote:
>> Basically, is sysfs the right place to even expose this much data?
> 
> I definitly want to avoid the memoryX mistake. So i do not want to
> see one link directory per device. Taking my simple laptop as an
> example with 4 CPUs, a wifi and 2 GPU (the integrated one and a
> discret one):
> 
> link0: cpu0 cpu1 cpu2 cpu3
> link1: wifi (2 pcie lane)
> link2: gpu0 (unknown number of lane but i believe it has higher
>              bandwidth to main memory)
> link3: gpu1 (16 pcie lane)
> link4: gpu1 and gpu memory
> 
> So one link directory per number of pcie lane your device have
> so that you can differentiate on bandwidth. The main memory is
> symlinked inside all the link directory except link4. The GPU
> discret memory is only in link4 directory as it is only
> accessible by the GPU (we could add it under link3 too with the
> non cache coherent property attach to it).

I'm actually really interested in how this proposal scales.  It's quite
easy to represent a laptop, but can this scale to the largest systems
that we expect to encounter over the next 20 years that this ABI will live?

> The issue then becomes how to convert down the HMAT over verbose
> information to populate some reasonable layout for HMS. For that
> i would say that create a link directory for each different
> matrix cell. As an example let say that each entry in the matrix
> has bandwidth and latency then we create a link directory for
> each combination of bandwidth and latency. On simple system that
> should boils down to a handfull of combination roughly speaking
> mirroring the example above of one link directory per number of
> PCIE lane for instance.

OK, but there are 1024*1024 matrix cells on a systems with 1024
proximity domains (ACPI term for NUMA node).  So it sounds like you are
proposing a million-directory approach.

We also can't simply say that two CPUs with the same connection to two
other CPUs (think a 4-socket QPI-connected system) share the same "link"
because they share the same combination of bandwidth and latency.  We
need to know that *each* has its own, unique link and do not share link
resources.

> I don't think i have a system with an HMAT table if you have one
> HMAT table to provide i could show up the end result.

It is new enough (ACPI 6.2) that no publicly-available hardware that
exists that implements one (that I know of).  Keith Busch can probably
extract one and send it to you or show you how we're faking them with QEMU.

> Note i believe the ACPI HMAT matrix is a bad design for that
> reasons ie there is lot of commonality in many of the matrix
> entry and many entry also do not make sense (ie initiator not
> being able to access all the targets). I feel that link/bridge
> is much more compact and allow to represent any directed graph
> with multiple arrows from one node to another same node.

I don't disagree.  But, folks are building systems with them and we need
to either deal with it, or make its data manageable.  You saw our
approach: we cull the data and only expose the bare minimum in sysfs.