linux-kernel - Re: [PATCH v2 RESEND] Add NumaChip remote PCI support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <50B8EFBF.2050403@numascale.com>
Date:	Fri, 30 Nov 2012 18:41:19 +0100
From:	Steffen Persvold <sp@...ascale.com>
To:	Bjorn Helgaas <bhelgaas@...gle.com>
CC:	Daniel J Blueman <daniel@...ascale-asia.com>,
	linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 RESEND] Add NumaChip remote PCI support

Hi Bjorn,

On 11/30/2012 17:45, Bjorn Helgaas wrote:
> On Thu, Nov 29, 2012 at 10:28 PM, Daniel J Blueman
[]
>> We could expose pci_dev_base via struct x86_init_pci; the extra complexity
>> and performance tradeoff may not be worth it for a single case perhaps?
>
> Oh, right, I forgot that you can't decide this at build-time.  This is
> PCI config access, which is not a performance path, so I'm not really
> concerned about it from that angle, but you make a good point about
> the complexity.
>
> The reason I'm interested in this is because MMCONFIG is a generic
> PCIe feature but is currently done via several arch-specific
> implementations, so I'm starting to think about how we can make parts
> of it more generic.  From that perspective, it's nicer to parameterize
> an existing implementation than to clone it because it makes
> refactoring opportunities more obvious.
>
> Backing up a bit, I'm curious about exactly why you need to check for
> the limit to begin with.  The comment says "Ensure AMD Northbridges
> don't decode reads to other devices," but that doesn't seem strictly
> accurate.  You're not changing anything in the hardware to prevent it
> from *decoding* a read, so it seems like you're actually just
> preventing the read in the first place.
>
> What happens without the limit check?  Do you get a response timeout
> and a machine check?  Read from the wrong device?'

The latter. I'm not sure how familiar you are with how pci config reads 
are decoded and handled on coherent hypertransport fabrics; The way it 
works *within* one coherent HT fabric is that the CPU will redirect all 
config space access above a configured max HT node (a setting in the AMD 
northbridge) to a specific I/O link (non-coherent link) which usually 
links up with a "southbridge" device that responds with a target abort 
(non-existing device).

However, this only works when a CPU core is accessing local HT devices. 
In our architecture, we "glue" together multiple HT fabrics and when a 
CPU core sends a pci config space request (mmconfig) to a remote machine 
(via our hardware) this re-direction is not applied anymore. The result 
is that when a mmconfig read comes in to a coherent HT device on bus00 
which is non-existent, one of the other HT nodes on that remote node 
will respond to the read, leading to "phantom" devices (i.e lspci will 
show more HT northbridges than what's really physically present) *or* 
worst case scenario will be that the transaction hangs (alternatively 
times out, leading to MCE and other bad things).

This is why we're checking accesses to bus0, device24-31 and returning a 
"fake" target abort scenario if the access was to a non-existing HT 
device. In other words, we're doing in software what a "normal" HT based 
platform would do in hardware.

>
> As far as I can tell, you still describe your MMCONFIG area with an
> MCFG table (since you use pci_mmconfig_lookup() to find the region).
> That table only includes the starting and ending bus numbers, so the
> assumption is that the MMCONFIG space is valid for every possible
> device on those buses.  So it seems like your system is not really
> compatible with the spec here.
>
> Because the MCFG table can't describe finer granularity than start/end
> bus numbers, we manage MMCONFIG regions as (segment, start_bus,
> end_bus, address) tuples.  Maybe if we tracked it with slightly finer
> granularity, e.g., (segment, start_bus, end_bus, end_bus_device,
> address), you could have some sort of MCFG-parsing quirk that reduces
> the size of the MMCONFIG region you register for bus 0.
>
> Just brainstorming here; it's not obvious to me yet what the best solution is.
>
> Bjorn
>

Kind regards,
-- 
Steffen Persvold, Chief Architect NumaChip
Numascale AS - www.numascale.com
Tel: +47 92 49 25 54 Skype: spersvold
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/