linux-kernel - RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with parameters affecting NUMA topology

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <MWHPR21MB1593386F34FD34260FAE89EFD7589@MWHPR21MB1593.namprd21.prod.outlook.com>
Date:   Tue, 18 Jan 2022 22:59:06 +0000
From:   "Michael Kelley (LINUX)" <mikelley@...rosoft.com>
To:     Long Li <longli@...rosoft.com>,
        "longli@...uxonhyperv.com" <longli@...uxonhyperv.com>,
        "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
        Purna Pavan Chandra Aekkaladevi <paekkaladevi@...rosoft.com>
Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with
 parameters affecting NUMA topology

From: Long Li <longli@...rosoft.com> Sent: Tuesday, January 18, 2022 2:44 PM
> >
> > From: Long Li <longli@...rosoft.com> Sent: Wednesday, January 12, 2022 4:59
> > PM
> > >
> > > > Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel
> > > > boots with parameters affecting NUMA topology
> > > >
> > > > From: Long Li <longli@...rosoft.com> Sent: Friday, January 7, 2022
> > > > 12:32 PM
> > > > > >
> > > > > > From: longli@...uxonhyperv.com <longli@...uxonhyperv.com> Sent:
> > > > > > Thursday, January 6, 2022 3:20 PM
> > > > > > >
> > > > > > > When the kernel boots with parameters restricting the number
> > > > > > > of cpus or NUMA nodes, e.g. maxcpus=X or numa=off, the vPCI
> > > > > > > driver should only set to the NUMA node to a value that is valid in the
> > current running kernel.
> > > > > > >
> > > > > > > Signed-off-by: Long Li <longli@...rosoft.com>
> > > > > > > ---
> > > > > > >  drivers/pci/controller/pci-hyperv.c | 17 +++++++++++++++--
> > > > > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > > > > > b/drivers/pci/controller/pci- hyperv.c index
> > > > > > > fc1a29acadbb..8686343eff4c 100644
> > > > > > > --- a/drivers/pci/controller/pci-hyperv.c
> > > > > > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > > > > > @@ -1835,8 +1835,21 @@ static void
> > > > > > > hv_pci_assign_numa_node(struct hv_pcibus_device *hbus)
> > > > > > >  		if (!hv_dev)
> > > > > > >  			continue;
> > > > > > >
> > > > > > > -		if (hv_dev->desc.flags &
> > HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> > > > > > > -			set_dev_node(&dev->dev, hv_dev-
> > >desc.virtual_numa_node);
> > > > > > > +		if (hv_dev->desc.flags &
> > HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) {
> > > > > > > +			int cpu;
> > > > > > > +			bool found_node = false;
> > > > > > > +
> > > > > > > +			for_each_possible_cpu(cpu)
> > > > > > > +				if (cpu_to_node(cpu) ==
> > > > > > > +				    hv_dev->desc.virtual_numa_node) {
> > > > > > > +					found_node = true;
> > > > > > > +					break;
> > > > > > > +				}
> > > > > > > +
> > > > > > > +			if (found_node)
> > > > > > > +				set_dev_node(&dev->dev,
> > > > > > > +					     hv_dev-
> > >desc.virtual_numa_node);
> > > > > > > +		}
> > > > > >
> > > > > > I'm wondering about this approach vs. just comparing against
> > nr_node_ids.
> > > > >
> > > > > I was trying to fix this by comparing with nr_node_ids. This
> > > > > worked for numa=off, but it didn't work with maxcpus=X.
> > > > >
> > > > > maxcpus=X is commonly used in kdump kernels. In this config,  the
> > > > > memory system is initialized in a way that only the NUMA nodes
> > > > > within maxcpus are setup and can be used by the drivers.
> > > >
> > > > In looking at a 5.16 kernel running in a Hyper-V VM on two NUMA
> > > > nodes, the number of NUMA nodes configured in the kernel is not
> > > > affected by maxcpus= on the kernel boot line.  This VM has 48 vCPUs
> > > > and 2 NUMA nodes, and is Generation 2.  Even with maxcpus=4 or
> > > > maxcpus=1, these lines are output during
> > > > boot:
> > > >
> > > > [    0.238953] NODE_DATA(0) allocated [mem 0x7edffd5000-0x7edfffffff]
> > > > [    0.241397] NODE_DATA(1) allocated [mem 0xfcdffd4000-0xfcdfffefff]
> > > >
> > > > and
> > > >
> > > > [    0.280039] Initmem setup node 0 [mem 0x0000000000001000-
> > 0x0000007edfffffff]
> > > > [    0.282869] Initmem setup node 1 [mem 0x0000007ee0000000-
> > 0x000000fcdfffffff]
> > > >
> > > > It's perfectly legit to have a NUMA node with memory but no CPUs.
> > > > The memory assigned to the NUMA node is determined by the ACPI SRAT.
> > > > So I'm wondering what is causing the kdump issue you see.  Or maybe
> > > > the behavior of older kernels is different.
> > >
> > > Sorry, it turns out I had a typo. It's nr_cpus=1 (not maxcpus). But
> > > I'm not sure if that matters as the descriptions on these two in the kernel doc
> > are the same.
> > >
> > > On my system (4 NUMA nodes) with kdump boot line:  (maybe if you try a
> > > VM with 4 NUMA nodes, you can see the problem)
> > > [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-1025-azure
> > > root=PARTUUID=7145c36d-e182-43b6-a37e-0b6d18fef8fe ro console=tty1
> > > console=ttyS0
> > > earlyprintk=ttyS0 reset_devices systemd.unit=kdump-tools-dump.service
> > > nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0
> > > elfcorehdr=4038049140K
> > >
> > > I see the following:
> > > [    0.408246] NODE_DATA(0) allocated [mem 0x2cfd6000-0x2cffffff]
> > > [    0.410454] NODE_DATA(3) allocated [mem 0x3c2bef32000-0x3c2bef5bfff]
> > > [    0.413031] Zone ranges:
> > > [    0.414117]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> > > [    0.416522]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
> > > [    0.418932]   Normal   [mem 0x0000000100000000-0x000003c2bef5cfff]
> > > [    0.421357]   Device   empty
> > > [    0.422454] Movable zone start for each node
> > > [    0.424109] Early memory node ranges
> > > [    0.425541]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
> > > [    0.428050]   node   0: [mem 0x000000001d000000-0x000000002cffffff]
> > > [    0.430547]   node   3: [mem 0x000003c27f000000-0x000003c2bef5cfff]
> > > [    0.432963] Initmem setup node 0 [mem 0x0000000000001000-
> > 0x000000002cffffff]
> > > [    0.435695] Initmem setup node 3 [mem 0x000003c27f000000-
> > 0x000003c2bef5cfff]
> > > [    0.438446] On node 0, zone DMA: 1 pages in unavailable ranges
> > > [    0.439377] On node 0, zone DMA32: 53088 pages in unavailable ranges
> > > [    0.452784] On node 3, zone Normal: 40960 pages in unavailable ranges
> > > [    0.455221] On node 3, zone Normal: 4259 pages in unavailable ranges
> > >
> > > It's unclear to me why node 1 and 2 are missing. But I don't think
> > > it's a Hyper-V problem since it's only affected by setting nr_cpus
> > > over kernel boot line. Later, a device driver
> > > (mlx5 in this example) tries to allocate memory on node 1 and fails:
> > >
> >
> > To summarize some offline conversation, we've figured out that the "missing"
> > NUMA nodes are not due to setting maxcpus=1 or nr_cpus=1.  Setting the cpu
> > count doesn't affect any of this.
> >
> > Instead, Linux is modifying the memory map prior to starting the kdump kernel
> > so that most of the memory is not touched and is
> > preserved to be dumped, which is the whole point of kdump.   This
> > modified memory map has no memory in NUMA nodes 1 and 2, so it is correct
> > to just see nodes 0 and 3 as online.
> >
> > I think code fix here is pretty simple:
> >
> > 	int node;
> >
> > 	node = hv_dev->desc.virtual_numa_node;
> > 	if ((hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> > 			&& (node < nr_node_ids))
> > 		set_dev_node(&dev->dev, numa_map_to_online_node(node));
> >
> > Michael
> 
> Okay, this looks good.
> 
> I'm sending a V2 (with a minor change) after testing is done.
> 
> Long

Please leave a comment in the code as to why a NUMA node might be
offline.   In the future, somebody new might not know what can happen.
I certainly didn't. :-(

Michael