lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Mon, 24 Mar 2014 14:03:07 +0800
From:	Daniel J Blueman <daniel@...greenorth.com>
To:	Bjorn Helgaas <bhelgaas@...gle.com>
CC:	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	"x86@...nel.org" <x86@...nel.org>, Borislav Petkov <bp@...e.de>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Steffen Persvold <sp@...ascale.com>,
	"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
	Suravee Suthikulpanit <suravee.suthikulpanit@....com>,
	kim.naru@....com,
	Aravind Gopalakrishnan <aravind.gopalakrishnan@....com>,
	Myron Stowe <myron.stowe@...hat.com>,
	"Rafael J. Wysocki" <rjw@...ysocki.net>,
	"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>
Subject: Re: [PATCH] Fix northbridge quirk to assign correct NUMA node

On 03/22/2014 12:11 AM, Bjorn Helgaas wrote:
> [+cc Rafael, linux-acpi for _PXM questions]
>
> On Thu, Mar 20, 2014 at 9:38 PM, Daniel J Blueman <daniel@...ascale.com> wrote:
>> On 21/03/2014 06:07, Bjorn Helgaas wrote:
>>> On Thu, Mar 13, 2014 at 5:43 AM, Daniel J Blueman <daniel@...ascale.com>
>>> wrote:
>>>>
>>>> For systems with multiple servers and routed fabric, all northbridges get
>>>> assigned to the first server. Fix this by also using the node reported
>>>> from
>>>> the PCI bus. For single-fabric systems, the northbriges are on PCI bus 0
>>>> by definition, which are on NUMA node 0 by definition, so this is
>>>> invarient
>>>> on most systems.
>>>>
>>>> Tested on fam10h and fam15h single and multi-fabric systems and candidate
>>>> for stable.
>
>>> So I suspect the problem is more complicated, and maybe _PXM is
>>> insufficient to describe the topology?  Are there subtrees that should
>>> have nodes different from the host bridge?
>>
>> Yes; see below.
>> ...
>> The _PXM method associates each northbridge with the first NUMA node, 0 in
>> single-fabric systems, and eg 4 for the second server in a multi-fabric
>> system with 2 dual-module Opterons (with 2 NUMA nodes internally) etc, since
>> the northbridges appear in the PCI tree, under the host bridge, not above it
>> [1].
>>
>> With _PXM, the rest of the PCI bus hierarchy has the right NUMA node
>> associated, but the northbridge PCI devices should be associated with their
>> actual NUMA node, 0, 1, 2, 3 for the first server in this example. The quirk
>> fixes this up; irqbalance at least uses this NUMA data exposed in /sys.
>
> I'm confused about which devices we're talking about.  We currently
> look at _PXM for PNP0A08 (and PNP0A03) ACPI devices.  The resulting
> node is associated with every PCI device we enumerate below the
> PNP0A08 bridge.  This association is made in pci_device_add().
>
> When you say "northbridge PCI devices should be associated with their
> actual NUMA node," I assume you mean the 00:18.x and 00:19.x devices
> ("AMD Family 10h Processor ..."), since those seem to be what the
> quirk applies to.  You are *not* talking about 00:00.0 ("ATI RD890
> Northbridge"), right?

Yes, on bus 0, devices 0x18 to 0x20 decode to the (up to) eight 
Hypertransport devices in the processor fabric, normally all processor 
northbridges.

> You mention irqbalance; is the NUMA node information for the 00:18.x
> and 00:19.x devices important because you get a lot of interrupts from
> those devices?  Or is the issue with actual I/O devices (NICs, SCSI
> adapters, etc.)?  If so, I don't see how this quirk would affect
> those, because the node information for them comes from the PNP0A08
> bridge (in pci_device_add()), not from the 00:00.0, 00:18.x, or
> 00:19.x devices.

I need to investigate the lockups irqbalance was causing on a customer 
system, and am not sure what interrupt source that was rewritten which 
causing hangs; disabling the daemon prevented the hangs.

>> The alternative to the quirk may be to explicitly express the northbridge
>> PCI devices in the AML with their own _PXM methods. If it's valid, it may be
>> the honest approach, though the quirk may be needed for most BIOSs; I can
>> check the AML on a few servers to confirm if helpful.
>
> ACPI allows _PXM for any device, so this might be a possible approach.
>   However, it looks like Linux only pays attention to _PXM for
> PNP0A08/03, CPUs, memory and IOAPICs (which seems like a Linux defect
> to me).

> I'm really worried about the approach here:
>
>          pci_read_config_dword(nb_ht, 0x60, &val);
>          node = pcibus_to_node(dev->bus) | (val & 7);
>
> because the pcibus_to_node() information comes indirectly from _PXM,
> and the "val" part comes from the hardware, and I don't think these
> are the same node number space.  If I understand correctly, the BIOS
> can synthesize whatever numbers it wants for _PXM, which returns a
> "proximity domain," and then Linux can make up its own mapping of
> "proximity domain" to "logical Linux node."  So I don't see why we can
> assume that it's valid to OR in the bits from a PCI config register to
> this logical Linux node number.

pcibus_to_node uses the proximity domain values in the ACPI SRAT table, 
which is thus correctly mapped to the linux NUMA node ID, so my oneliner 
is still progress.

Linux allocates NUMA node ids using the ordering of PXM values seen in 
the SRAT table, ie first_unset_node(nodes_found_map). The APIC ids are 
initialised using the HyperTransport NodeId [1, p263 and p465], but the 
NodeId can be reprogrammed after the APIC ids are set (which also 
changes the PCI configuration device id from 0x18 on bus 0 it responds 
to), and the SRAT table needn't be emitted in order, perhaps except for 
the bootstrap core.

I guess fixing the original quirk depends on how important these cases 
really are.

Thanks,
   Daniel

[1] http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf
-- 
Daniel J Blueman
Principal Software Engineer, Numascale
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ