linux-kernel - Re: [RFC] cxl/region: set numa node for target memdevs when a region is committed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20250321122256.00005b71@huawei.com>
Date: Fri, 21 Mar 2025 12:22:56 +0000
From: Jonathan Cameron <Jonathan.Cameron@...wei.com>
To: Dan Williams <dan.j.williams@...el.com>
CC: Fan Ni <nifan.cxl@...il.com>, Dave Jiang <dave.jiang@...el.com>,
	<alison.schofield@...el.com>, <vishal.l.verma@...el.com>,
	<ira.weiny@...el.com>, <linux-cxl@...r.kernel.org>,
	<a.manzanares@...sung.com>, <dave@...olabs.net>,
	<linux-kernel@...r.kernel.org>, <anisa.su887@...il.com>
Subject: Re: [RFC] cxl/region: set numa node for target memdevs when a
 region is committed

On Tue, 18 Mar 2025 17:16:11 -0700
Dan Williams <dan.j.williams@...el.com> wrote:

> Fan Ni wrote:
> > On Tue, Mar 18, 2025 at 02:25:40PM -0700, Dan Williams wrote:  
> > > Dave Jiang wrote:  
> > > > 
> > > > 
> > > > On 3/14/25 9:40 AM, nifan.cxl@...il.com wrote:  
> > > > > From: Fan Ni <fan.ni@...sung.com>
> > > > > 
> > > > > There is a sysfs attribute named "numa_node" for cxl memory device.
> > > > > however, it is never set so -1 is returned whenever it is read.
> > > > > 
> > > > > With this change, the numa_node of each target memdev is set based on the
> > > > > start address of the hpa_range of the endpoint decoder it associated when a
> > > > > cxl region is created; and it is reset when the region decoders are
> > > > > reset.
> > > > > 
> > > > > Open qeustion: do we need to set the numa_node when the memdev is
> > > > > probed instead of waiting until a region is created?  
> > > > 
> > > > Typically, the numa node for a PCI device should be dev_to_node(),
> > > > where the device resides. So when the device is probed, it should be
> > > > set with that. See documentation [1]. Region should have its own NUMA
> > > > node based on phys_to_target_node() of the starting address.    
> > > 
> > > Right, the memdev node is the affinity of device-MMIO to a CPU. The
> > > HDM-memory that the device decodes may land in multiple proximity
> > > domains and is subject to CDAT, CXL QoS, HMAT Generic Port, etc...
> > > 
> > > If your memdev node is "NUMA_NO_NODE" then that likely means the
> > > affinity information for the PCI device is missing.
> > > 
> > > I would double check that first. See set_dev_node() in device_add().  
> > 
> > Thanks Dave and Dan for the explanation. 
> > Then the issue must be from qemu setup.
> > 
> > I added some debug code as below
> > ---------------------------------------------
> > fan:~/cxl/linux-fixes$ git diff
> > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > index 5a1f05198114..c86a9eb58e99 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -3594,6 +3594,10 @@ int device_add(struct device *dev)
> >         if (kobj)
> >                 dev->kobj.parent = kobj;
> >  
> > +        dev_dbg(dev, "device: '%s': %s XX node %d\n", dev_name(dev), __func__, dev_to_node(dev));
> > +        if (parent) {
> > +                dev_dbg(parent, "parent device: '%s': %s XX node %d\n", dev_name(parent), __func__, dev_to_node(parent));
> > +        }
> >         /* use parent numa_node */
> >         if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
> >                 set_dev_node(dev, dev_to_node(parent));
> > ---------------------------------------------
> > 
> > The output after loading cxl related drivers looks like below. All
> > numa_node is -1 in the cxl topology. 
> > 
> > Hi Jonathan,
> >    do I miss something in the qemu setup ??  
> 
> IIUC the typical expectation for communicating the affinity of PCI
> devices is an ACPI _PXM property for the host bridge object in the
> [DS]SDT. As far as I can see QEMU does not build _PXM information for
> its host bridges.
> 
First a side note.  _PXM on device is in theory also an option, but
long ago the 'fix' for that was reverted due to some really broken old
AMD platforms that put devices in non existent nodes. Hmm. I should
revisit that as I 'think' all the allocation with broken numa nodes
is long fixed (included an ACPI spec clarification so took a while!)
https://lore.kernel.org/linux-pci/20181211094737.71554-1-Jonathan.Cameron@huawei.com/

As for _PXM on host bridges, the gpex ACPI code does assign them for
PCI Expander Bridges, if you pass in the node
https://elixir.bootlin.com/qemu/v9.2.2/source/hw/pci-host/gpex-acpi.c#L178
(So we are good on ARM :)
https://elixir.bootlin.com/qemu/v9.2.2/source/hw/i386/acpi-build.c#L1533
does the same on x86. 

Those go via some indirections to a callback here:
https://elixir.bootlin.com/qemu/v9.2.2/source/hw/pci-bridge/pci_expander_bridge.c#L80

So set numa_node=X for each of your PXB instances.

Jonathan