linux-kernel - Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87o7dv897s.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Tue, 09 Jan 2024 11:41:11 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Gregory Price <gregory.price@...verge.com>
Cc: Srinivasulu Thanneeru <sthanneeru@...ron.com>,  Srinivasulu Opensrc
 <sthanneeru.opensrc@...ron.com>,  "linux-cxl@...r.kernel.org"
 <linux-cxl@...r.kernel.org>,  "linux-mm@...ck.org" <linux-mm@...ck.org>,
  "aneesh.kumar@...ux.ibm.com" <aneesh.kumar@...ux.ibm.com>,
  "dan.j.williams@...el.com" <dan.j.williams@...el.com>,  "mhocko@...e.com"
 <mhocko@...e.com>,  "tj@...nel.org" <tj@...nel.org>,
  "john@...alactic.com" <john@...alactic.com>,  Eishan Mirakhur
 <emirakhur@...ron.com>,  "Vinicius Tavares Petrucci"
 <vtavarespetr@...ron.com>,  Ravis OpenSrc <Ravis.OpenSrc@...ron.com>,
  "Jonathan.Cameron@...wei.com" <Jonathan.Cameron@...wei.com>,
  "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,  Johannes
 Weiner <hannes@...xchg.org>,  Wei Xu <weixugc@...gle.com>,  Hao Xiang
 <hao.xiang@...edance.com>,  "Ho-Ren (Jack) Chuang"
 <horenchuang@...edance.com>
Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

Gregory Price <gregory.price@...verge.com> writes:

> On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
>> >
>> > From  https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
>> > abstract_distance_offset: override by users to deal with firmware issue.
>> >
>> > say firmware can configure the cxl node into wrong tiers, similar to
>> > that it may also configure all cxl nodes into single memtype, hence
>> > all these nodes can fall into a single wrong tier.
>> > In this case, per node adistance_offset would be good to have ?
>> 
>> I think that it's better to fix the error firmware if possible.  And
>> these are only theoretical, not practical issues.  Do you have some
>> practical issues?
>> 
>> I understand that users may want to move nodes between memory tiers for
>> different policy choices.  For that, memory_type based adistance_offset
>> should be good.
>> 
>
> There's actually an affirmative case to change memory tiering to allow
> either movement of nodes between tiers, or at least base placement on
> HMAT information. Preferably, membership would be changable to allow
> hotplug/DCD to be managed (there's no guarantee that the memory passed
> through will always be what HMAT says on initial boot).

IIUC, from Jonathan Cameron as below, the performance of memory
shouldn't change even for DCD devices.

https://lore.kernel.org/linux-mm/20231103141636.000007e4@Huawei.com/

It's possible to change the performance of a NUMA node changed, if we
hot-remove a memory device, then hot-add another different memory
device.  It's hoped that the CDAT changes too.

So, all in all, HMAT + CDAT can help us to put the memory device in
appropriate memory tiers.  Now, we have HMAT support in upstream.  We
will working on CDAT support.

--
Best Regards,
Huang, Ying

> https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
>
> This group wants to enable passing CXL memory through to KVM/QEMU
> (i.e. host CXL expander memory passed through to the guest), and
> allow the guest to apply memory tiering.
>
> There are multiple issues with this, presently:
>
> 1. The QEMU CXL virtual device is not and probably never will be
>    performant enough to be a commodity class virtualization.  The
>    reason is that the virtual CXL device is built off the I/O
>    virtualization stack, which treats memory accesses as I/O accesses.
>
>    KVM also seems incompatible with the design of the CXL memory device
>    in general, but this problem may or may not be a blocker.
>
>    As a result, access to virtual CXL memory device leads to QEMU
>    crawling to a halt - and this is unlikely to change.
>
>    There is presently no good way forward to create a performant virtual
>    CXL device in QEMU.  This means the memory tiering component in the
>    kernel is functionally useless for virtual CXL memory, because...
>
> 2. When passing memory through as an explicit NUMA node, but not as
>    part of a CXL memory device, the nodes are lumped together in the
>    DRAM tier.
>
> None of this has to do with firmware.
>
> Memory-type is an awful way of denoting membership of a tier, but we
> have HMAT information that can be passed through via QEMU:
>
> -object memory-backend-ram,size=4G,id=ram-node0 \
> -object memory-backend-ram,size=4G,id=ram-node1 \
> -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
>
> Not only would it be nice if we could change tier membership based on
> this data, it's realistically the only way to allow guests to accomplish
> memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
>
> ~Gregory