linux-kernel - RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with parameters affecting NUMA topology

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <BY5PR21MB15067D1C5AA731A340A7AF34CE4D9@BY5PR21MB1506.namprd21.prod.outlook.com>
Date:   Fri, 7 Jan 2022 20:31:40 +0000
From:   Long Li <longli@...rosoft.com>
To:     "Michael Kelley (LINUX)" <mikelley@...rosoft.com>,
        "longli@...uxonhyperv.com" <longli@...uxonhyperv.com>,
        "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
        Purna Pavan Chandra Aekkaladevi <paekkaladevi@...rosoft.com>
Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with
 parameters affecting NUMA topology

> Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots
> with parameters affecting NUMA topology
> 
> From: longli@...uxonhyperv.com <longli@...uxonhyperv.com> Sent:
> Thursday, January 6, 2022 3:20 PM
> >
> > When the kernel boots with parameters restricting the number of cpus
> > or NUMA nodes, e.g. maxcpus=X or numa=off, the vPCI driver should only
> > set to the NUMA node to a value that is valid in the current running kernel.
> >
> > Signed-off-by: Long Li <longli@...rosoft.com>
> > ---
> >  drivers/pci/controller/pci-hyperv.c | 17 +++++++++++++++--
> >  1 file changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/pci/controller/pci-hyperv.c
> > b/drivers/pci/controller/pci- hyperv.c index
> > fc1a29acadbb..8686343eff4c 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -1835,8 +1835,21 @@ static void hv_pci_assign_numa_node(struct
> > hv_pcibus_device *hbus)
> >  		if (!hv_dev)
> >  			continue;
> >
> > -		if (hv_dev->desc.flags &
> HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> > -			set_dev_node(&dev->dev, hv_dev-
> >desc.virtual_numa_node);
> > +		if (hv_dev->desc.flags &
> HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) {
> > +			int cpu;
> > +			bool found_node = false;
> > +
> > +			for_each_possible_cpu(cpu)
> > +				if (cpu_to_node(cpu) ==
> > +				    hv_dev->desc.virtual_numa_node) {
> > +					found_node = true;
> > +					break;
> > +				}
> > +
> > +			if (found_node)
> > +				set_dev_node(&dev->dev,
> > +					     hv_dev-
> >desc.virtual_numa_node);
> > +		}
> 
> I'm wondering about this approach vs. just comparing against nr_node_ids.

I was trying to fix this by comparing with nr_node_ids. This worked for numa=off, but it didn't work with maxcpus=X.

maxcpus=X is commonly used in kdump kernels. In this config,  the memory system is initialized in a way that only the NUMA nodes within maxcpus are setup and can be used by the drivers.

> Comparing against nr_node_ids would handle the case of numa=off on the
> kernel boot line, or a kernel built with CONFIG_NUMA=n, or the use of
> numa=fake.  Your approach is also affected by which CPUs are online, since
> cpu_to_node() references percpu data.  It would seem to produce more
> variable results since CPUs can go online and offline while the VM is running.
> If a network VF device was removed and re-added, the results of your
> algorithm could be different for the re-add, depending on which CPUs were
> online at the time.
> 
> My impression (which may be incorrect) is that the device numa_node is
> primarily to allow the driver to allocate memory from the closest NUMA node,
> and such memory allocations don't really need to be affected by which CPUs
> are online.

Yes, this is the reason I'm using for_each_possible_cpu(). Even if some CPUs are not online, the memory system is setup in a way that allow driver to allocate memory on that NUMA node. The algorithm guarantees the value of NUMA node is valid when calling set_dev_node().

> 
> Thoughts?
> 
> >
> >  		put_pcichild(hv_dev);
> >  	}
> > --
> > 2.25.1