linux-kernel - Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87bk9t91tx.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Wed, 10 Jan 2024 13:47:38 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Gregory Price <gregory.price@...verge.com>
Cc: Jonathan Cameron <Jonathan.Cameron@...wei.com>,  Srinivasulu Thanneeru
 <sthanneeru@...ron.com>,  Srinivasulu Opensrc
 <sthanneeru.opensrc@...ron.com>,  "linux-cxl@...r.kernel.org"
 <linux-cxl@...r.kernel.org>,  "linux-mm@...ck.org" <linux-mm@...ck.org>,
  "aneesh.kumar@...ux.ibm.com" <aneesh.kumar@...ux.ibm.com>,
  "dan.j.williams@...el.com" <dan.j.williams@...el.com>,  "mhocko@...e.com"
 <mhocko@...e.com>,  "tj@...nel.org" <tj@...nel.org>,
  "john@...alactic.com" <john@...alactic.com>,  Eishan Mirakhur
 <emirakhur@...ron.com>,  Vinicius Tavares Petrucci
 <vtavarespetr@...ron.com>,  Ravis OpenSrc <Ravis.OpenSrc@...ron.com>,
  "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,  "Johannes
 Weiner" <hannes@...xchg.org>,  Wei Xu <weixugc@...gle.com>,  Hao Xiang
 <hao.xiang@...edance.com>,  "Ho-Ren (Jack) Chuang"
 <horenchuang@...edance.com>
Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

Gregory Price <gregory.price@...verge.com> writes:

> On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
>> On Tue, 09 Jan 2024 11:41:11 +0800
>> "Huang, Ying" <ying.huang@...el.com> wrote:
>> > Gregory Price <gregory.price@...verge.com> writes:
>> > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:  
>> > It's possible to change the performance of a NUMA node changed, if we
>> > hot-remove a memory device, then hot-add another different memory
>> > device.  It's hoped that the CDAT changes too.
>> 
>> Not supported, but ACPI has _HMA methods to in theory allow changing
>> HMAT values based on firmware notifications...  So we 'could' make
>> it work for HMAT based description.
>> 
>> Ultimately my current thinking is we'll end up emulating CXL type3
>> devices (hiding topology complexity) and you can update CDAT but
>> IIRC that is only meant to be for degraded situations - so if you
>> want multiple performance regions, CDAT should describe them form the start.
>> 
>
> That was my thought.  I don't think it's particularly *realistic* for
> HMAT/CDAT values to change at runtime, but I can imagine a case where
> it could be valuable.
>
>> > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
>> > >
>> > > This group wants to enable passing CXL memory through to KVM/QEMU
>> > > (i.e. host CXL expander memory passed through to the guest), and
>> > > allow the guest to apply memory tiering.
>> > >
>> > > There are multiple issues with this, presently:
>> > >
>> > > 1. The QEMU CXL virtual device is not and probably never will be
>> > >    performant enough to be a commodity class virtualization.
>> 
>> I'd flex that a bit - we will end up with a solution for virtualization but
>> it isn't the emulation that is there today because it's not possible to
>> emulate some of the topology in a peformant manner (interleaving with sub
>> page granularity / interleaving at all (to a lesser degree)). There are
>> ways to do better than we are today, but they start to look like
>> software dissagregated memory setups (think lots of page faults in the host).
>>
>
> Agreed, the emulated device as-is can't be the virtualization device,
> but it doesn't mean it can't be the basis for it.
>
> My thought is, if you want to pass host CXL *memory* through to the
> guest, you don't actually care to pass CXL *control* through to the
> guest.  That control lies pretty squarely with the host/hypervisor.
>
> So, at least in theory, you can just cut the type3 device out of the
> QEMU configuration entirely and just pass it through as a distinct numa
> node with specific hmat qualities.
>
> Barring that, if we must go through the type3 device, the question is
> how difficult would it be to just make a stripped down type3 device
> to provide the informational components, but hack off anything
> topology/interleave related? Then you just do direct passthrough as you
> described below.
>
> qemu/kvm would report errors if you tried to touch the naughty bits.
>
> The second question is... is that device "compliant" or does it need
> super special handling from the kernel driver :D?  If what i described
> is not "compliant", then it's probably a bad idea, and KVM/QEMU should
> just hide the CXL device entirely from the guest (for this use case)
> and just pass the memory through as a numa node.
>
> Which gets us back to: The memory-tiering component needs a way to
> place nodes in different tiers based on HMAT/CDAT/User Whim. All three
> of those seem like totally valid ways to go about it.
>
>> > >
>> > > 2. When passing memory through as an explicit NUMA node, but not as
>> > >    part of a CXL memory device, the nodes are lumped together in the
>> > >    DRAM tier.
>> > >
>> > > None of this has to do with firmware.
>> > >
>> > > Memory-type is an awful way of denoting membership of a tier, but we
>> > > have HMAT information that can be passed through via QEMU:
>> > >
>> > > -object memory-backend-ram,size=4G,id=ram-node0 \
>> > > -object memory-backend-ram,size=4G,id=ram-node1 \
>> > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
>> > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
>> > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
>> > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
>> > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
>> > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
>> > >
>> > > Not only would it be nice if we could change tier membership based on
>> > > this data, it's realistically the only way to allow guests to accomplish
>> > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
>> 
>> This I fully agree with.  There will be systems with a bunch of normal DDR with different
>> access characteristics irrespective of CXL. + likely HMAT solutions will be used
>> before we get anything more complex in place for CXL.
>> 
>
> Had not even considered this, but that's completely accurate as well.
>
> And more discretely: What of devices that don't provide HMAT/CDAT? That
> isn't necessarily a violation of any standard.  There probably could be
> a release valve for us to still make those devices useful.
>
> The concern I have with not implementing a movement mechanism *at all*
> is that a one-size-fits-all initial-placement heuristic feels gross
> when we're, at least ideologically, moving toward "software defined memory".
>
> Personally I think the movement mechanism is a good idea that gets folks
> where they're going sooner, and it doesn't hurt anything by existing. We
> can change the initial placement mechanism too.
>
> </2cents>

It's the last resort to provide hardware information from user space.
We should try to avoid that if possible.

Per my understanding, per-memory-type abstract distance overriding is to
apply specific policy.  While, per-memory-node abstract distance
overriding is to provide missing hardware information.

--
Best Regards,
Huang, Ying