[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4FBA8614.7020100@redhat.com>
Date: Mon, 21 May 2012 14:14:44 -0400
From: Don Dutile <ddutile@...hat.com>
To: Alex Williamson <alex.williamson@...hat.com>
CC: Bjorn Helgaas <bhelgaas@...gle.com>, kvm@...r.kernel.org,
B07421@...escale.com, aik@...abs.ru, benh@...nel.crashing.org,
linux-pci@...r.kernel.org, agraf@...e.de, qemu-devel@...gnu.org,
chrisw@...s-sol.org, B08248@...escale.com,
iommu@...ts.linux-foundation.org, gregkh@...uxfoundation.org,
avi@...hat.com, benve@...co.com, dwmw2@...radead.org,
linux-kernel@...r.kernel.org, david@...son.dropbear.id.au
Subject: Re: [PATCH 05/13] pci: New pci_acs_enabled()
On 05/21/2012 10:59 AM, Alex Williamson wrote:
> On Mon, 2012-05-21 at 09:31 -0400, Don Dutile wrote:
>> On 05/18/2012 10:47 PM, Alex Williamson wrote:
>>> On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
>>>> On 05/18/2012 06:02 PM, Alex Williamson wrote:
>>>>> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
>>>>>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
>>>>>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>>>>>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>>>>>>>> <alex.williamson@...hat.com> wrote:
>>>>>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>>>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>>>>>>>> <alex.williamson@...hat.com> wrote:
>>>>>>>>>>> In a PCIe environment, transactions aren't always required to
>>>>>>>>>>> reach the root bus before being re-routed. Peer-to-peer DMA
>>>>>>>>>>> may actually not be seen by the IOMMU in these cases. For
>>>>>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>>>>>>>> these restrictions. Provided with a PCI device, pci_acs_enabled
>>>>>>>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>>>>>>>> chain. This information can then be used in grouping to create
>>>>>>>>>>> fully isolated groups. ACS chain logic extracted from libvirt.
>>>>>>>>>>
>>>>>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>>>>>>>
>>>>>>>>> Right, maybe this should be:
>>>>>>>>>
>>>>>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>>>>>>>
>>>>>> +1; there is a global in the PCI code, pci_acs_enable,
>>>>>> and a function pci_enable_acs(), which the above name certainly
>>>>>> confuses. I recommend pci_find_top_acs_bridge()
>>>>>> would be most descriptive.
>>>> Finally, with my email filters fixed, I can see this email... :)
>>>
>>> Welcome back ;)
>>>
>> Indeed... and I recvd 3 copies of this reply,
>> so the pendulum has flipped the other direction... ;-)
>>
>>>>> Yep, the new API I'm working with is:
>>>>>
>>>>> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
>>>>> bool pci_acs_path_enabled(struct pci_dev *start,
>>>>> struct pci_dev *end, u16 acs_flags);
>>>>>
>>>> ok.
>>>>
>>>>>>>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>>>>>>>
>>>>>>>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>>>>>>>> guessing it returns the root of a subtree that must be contained in a
>>>>>>>>>> group.
>>>>>>>>>
>>>>>>>>> Any intermediate switch between an endpoint and the root bus can
>>>>>>>>> redirect a dma access without iommu translation,
>>>>>>>>
>>>>>>>> Is this "redirection" just the normal PCI bridge forwarding that
>>>>>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>>>>>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>>>>>>>> ranges that are forwarded from primary to secondary interface, and the
>>>>>>>> inverse ranges are forwarded from secondary to primary? For example,
>>>>>>>> here:
>>>>>>>>
>>>>>>>> ^
>>>>>>>> |
>>>>>>>> +--------+-------+
>>>>>>>> | |
>>>>>>>> +------+-----+ +-----++-----+
>>>>>>>> | Downstream | | Downstream |
>>>>>>>> | Port | | Port |
>>>>>>>> | 06:05.0 | | 06:06.0 |
>>>>>>>> +------+-----+ +------+-----+
>>>>>>>> | |
>>>>>>>> +----v----+ +----v----+
>>>>>>>> | Endpoint| | Endpoint|
>>>>>>>> | 07:00.0 | | 08:00.0 |
>>>>>>>> +---------+ +---------+
>>>>>>>>
>>>>>>>> that rule is all that's needed for a transaction from 07:00.0 to be
>>>>>>>> forwarded from upstream to the internal switch bus 06, then claimed by
>>>>>>>> 06:06.0 and forwarded downstream to 08:00.0. This is plain old PCI,
>>>>>>>> nothing specific to PCIe.
>>>>>>>
>>>>>>> Right, I think the main PCI difference is the point-to-point nature of
>>>>>>> PCIe vs legacy PCI bus. On a legacy PCI bus there's no way to prevent
>>>>>>> devices talking to each other, but on PCIe the transaction makes a
>>>>>>> U-turn at some point and heads out another downstream port. ACS allows
>>>>>>> us to prevent that from happening.
>>>>>>>
>>>>>> detail: PCIe up/downstream routing is really done by an internal switch;
>>>>>> ACS forces the legacy, PCI base-limit address routing and *forces*
>>>>>> the switch to always route the transaction from a downstream port
>>>>>> to the upstream port.
>>>>>>
>>>>>>>> I don't understand ACS very well, but it looks like it basically
>>>>>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
>>>>>>>> would be sent upstream toward the root (and specifically, the IOMMU)
>>>>>>>> instead of being directly claimed by 06:06.0.
>>>>>>>
>>>>>>> Yep, that's my meager understanding as well.
>>>>>>>
>>>>>> +1
>>>>>>
>>>>>>>>> so we're looking for
>>>>>>>>> the furthest upstream device for which acs is enabled all the way up to
>>>>>>>>> the root bus.
>>>>>>>>
>>>>>>>> Correct me if this is wrong: To force device A's DMAs to be processed
>>>>>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
>>>>>>>> port along the path to A.
>>>>>>>
>>>>>>> Yes, modulo this comment in libvirt source:
>>>>>>>
>>>>>>> /* if we have no parent, and this is the root bus, ACS doesn't come
>>>>>>> * into play since devices on the root bus can't P2P without going
>>>>>>> * through the root IOMMU.
>>>>>>> */
>>>>>>>
>>>>>> Correct. PCIe spec says roots must support ACS. I believe all the
>>>>>> root bridges that have an IOMMU have ACS wired in/on.
>>>>>
>>>>> Would you mind looking for the paragraph that says this? I'd rather
>>>>> code this into the iommu driver callers than core PCI code if this is
>>>>> just a platform standard.
>>>>>
>>>> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
>>>> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
>>>> Redirected Request Validation;
>>>
>>> Huh? (If support ACS.RR then must support ACS.UF) != must support ACS.
>>>
>> UF?
>
> Upstream Forwarding. So basically that spec quote is saying that if the
> RC supports one aspect of ACS then it must support this other. But that
> doesn't say to me that it must support ACS to begin with.
>
It says if RC supports peer-to-peer btwn root ports, the root ports must
support ACS.... at least that's how I understood it.
but, as we've seen, there's spec, then there's reality...
>>>> -- which means, if a Root port allows a peer-to-peer transaction to another
>>>> one of its ports, then it has to support ACS.
>>>
>>> I don't get that at all from 6.12.1.1, especially given the first
>>> sentence of that section:
>>>
>>> This section applies to Root Ports and Downstream Switch Ports
>>> that implement an ACS Extended Capability structure.
>>>
>>>
>> hmmm, well I did. The root port section is different than Downstream ports
>> as well. downstream ports *must* support peer-xfers due to positive decoding
>> of base/limit addresses, and ACS is optional in downstream ports.
>> Peer-to-peer btwn root ports is optional.
>> so, I don't get what you don't get... ;-)
>> .. but, I understand how the spec can be read/interpreted differently,
>> given it's clarity (did lawyers write it?!?!!), so I could be interpreting
>> incorrectly.
>>
>>>> So, this means that:
>>>> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
>>>> ACS isn't needed
>>>> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
>>>> it must have ACS capability if it does...
>>>> and since the linux code turns on ACS for all ports with an ACS cap,
>>>> it degenerates (in Linux) that all Root ports are implementing the
>>>> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
>>>>
>> I thought I explained how I interpreted the root-port part of ACS above,
>> so maybe you can tell me how you think my interpretation is incorrect.
>
> I don't know that it's incorrect, but I don't see how you're making the
> leap you are. I think the version I have now of the patch leaves this
> nicely to the iommu drivers. It would make sense, not that hardware
> always makes sense, that anything with an iommu is going to hardwire
> translations through the iommu when enabled or they couldn't reasonable
> do any kind of peer-to-peer with iommu.
>
agreed. the version you have looks good,
and avoids/handles potential ACS cap idiosyncracies in RC's.
>>>>>>> So we assume that a redirect at the point of the iommu will factor in
>>>>>>> iommu translation.
>>>>>>>
>>>>>>>> If so, I think you're trying to find out the closest upstream device X
>>>>>>>> such that everything leading to X has ACS enabled. Every device below
>>>>>>>> X can DMA freely to other devices below X, so they would all have to
>>>>>>>> be in the same isolated group.
>>>>>>>
>>>>>>> Yes
>>>>>>>
>>>>>>>> I tried to work through some examples to develop some intuition about this:
>>>>>>>
>>>>>>> (inserting fixed url)
>>>>>>>> http://www.asciiflow.com/#3736558963405980039
>>>>>>>
>>>>>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>>>>>>>> if 00:00.0 is PCIe or if RP has ACS?))
>>>>>>>
>>>>>>> Hmm, the latter is the assumption above. For the former, I think
>>>>>>> libvirt was probably assuming that PCI devices must have a PCIe device
>>>>>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
>>>>>>> except on PCIe. I'll need to work on making that more generic.
>>>>>>>
>>>>>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>>>>>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>>>>>>>> PCIe; seems wrong)
>>>>>>>
>>>>>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
>>>>>>> input devices, so this was passing for me. I'll need to incorporate
>>>>>>> that generically.
>>>>>>>
>>>>>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>>>>>>>> doesn't have ACS)
>>>>>>>
>>>>>>> Yeah, let me validate the libvirt assumption. I see ACS on my root
>>>>>>> port, so maybe they're just assuming it's always enabled or that the
>>>>>>> precedence favors IOMMU translation. I'm also starting to think that we
>>>>>>> might want "from" and "to" struct pci_dev parameters to make it more
>>>>>>> flexible where the iommu lives in the system.
>>>>>>>
>>>>>> see comment above wrt root ports that have IOMMUs in them.
>>>>>
>>>>> Except it really seems to be platform convention where the IOMMU lives.
>>>>> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
>>>>> for and from that we can make assumptions about where it physically
>>>>> lives. AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
>>>>> function on the root bus. For now I'm just allowing
>>>>> pci_acs_path_enabled to take NULL for and end, which means "up to the
>>>>> root bus".
>>>>>
>>>> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
>>>> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
>>>> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
>>>> yet-another spec (or ask Joerg to check-in to this thread& provide the details).
>>>
>>> But are we programming to convention or spec? And I'm still confused
>>> about why we assume the root port isn't susceptible to redirection
>>> before IOMMU translation. One of the benefits of the
>>> pci_acs_path_enable() API is that it pushes convention out to the IOMMU
>>> driver. So it's intel-iommu.c's problem whether to test for ACS support
>>> to the RC or to a given level (and for that matter know whether IOMMU
>>> translation takes precedence over redirection in the RC).
>>>
>> Agreed. the best design here would be for the intel-iommu.c to test for ACS
>> wrt the root port (not to be confused with root complex) since Intel-IOMMUs
>> are not 'attached' to a PCI bus. Now, AMD's IOMMUs are attached to a PCI bus,
>> so maybe it can be simplified in that case.... but it may be best to mimic
>> the iommu-set-acs-for-root-port attribute the same manner.
>>
>>>>>>
>>>>>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>>>>>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>>>>>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>>>>>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>>>>>>>> a bridge; seems wrong if 04:00 is a multi-function device)
>>>>>>>
>>>>>>> AIUI, ACS is not an endpoint property, so this is what should happen. I
>>>>>>> don't think multifunction plays a role other than how much do we trust
>>>>>>> the implementation to not allow back channels between functions (the
>>>>>>> answer should probably be not at all).
>>>>>>>
>>>>>> correct. ACS is a *bridge* property.
>>>>>> The unknown wrt multifunction devices is that such devices *could* be implemented
>>>>>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
>>>>>> btwn the functions within a device.
>>>>>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
>>>>>> force ACS. So, one has to ask the hw vendors if such a hidden device exists
>>>>>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
>>>>>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
>>>>>> and allow parent bridge re-route it back down if peer-to-peer is desired.
>>>>>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
>>>>>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
>>>>>> determine this status about a device (via pci cfg/cap space).
>>>>>
>>>>> Well, there is actually a section of the ACS part of the spec
>>>>> identifying valid flags for multifunction devices. Secretly I'd like to
>>>>> use this as justification for blacklisting all multifunction devices
>>>>> that don't explicitly support ACS, but that makes for pretty course
>>>>> granularity. For instance, all these devices end up in a group:
>>>>>
>>>>> +-14.0 ATI Technologies Inc SBx00 SMBus Controller
>>>>> +-14.2 ATI Technologies Inc SBx00 Azalia (Intel HDA)
>>>>> +-14.3 ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
>>>>> +-14.4-[05]----0e.0 VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
>>>>>
>>>>> 00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
>>>>>
>>>>> And these in another:
>>>>>
>>>>> +-15.0-[06]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
>>>>> +-15.1-[07]----00.0 Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>>>>> +-15.2-[08]--
>>>>> +-15.3-[09]--
>>>>>
>>>>> 00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
>>>>> 00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
>>>>> 00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
>>>>> 00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
>>>>>
>>>>> Am I misinterpreting the spec or is this the correct, if strict,
>>>>> interpretation?
>>>>>
>>>>> Alex
>>>>>
>>>> Well, in digging into the ACS-support in Root port question above,
>>>> I just found out about this ACS support status for multifunctions too.
>>>> I need more time to read/digest, but a quick read says MFDs should have
>>>> an ACS cap with relevant RO-status& control bits to direct/control
>>>> peer-to-peer and ACS-upstream. Lack of an ACS struct implies(?)
>>>> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
>>>> Unfortunately, I think a large lot of MFDs don't have ACS caps,
>>>> and don't/can't do peer-to-peer, so the above is heavy-handed,
>>>> albeit to spec.
>>>> Maybe we need a (large?) pci-quirk for the list of existing
>>>> MFDs that don't have ACS caps that would enable the above devices
>>>> to be in separate groups.
>>>> On the flip side, it solves some of the quirks for MFDs that
>>>> use the wrong BDF in their src-id dma packets! :) -- they default
>>>> to the same group now...
>>>
>>> Yep, sounds like you might agree with my patch, it's heavy handed, but
>> Yes, but we should design-in a quirk check list for MFDs,
>> b/c we already know some will fail when they should pass this check,
>> b/c the hw was made post ACS, or the vendors didn't see the benefit
>> of ACS reporting (even if the funcitonality was enabling bit-settings
>> that did nothing hw-wise), b/c they didn't understand the use case (VFIO,
>> dev-assignment to virt guests, etc.).
>>
>>> seems to adhere to the spec. That probably just means we need an option
>>> to allow a more lenient interpretation, that maybe we don't have to
>>> support. Thanks,
>> Right. As I said, a hook to do quirk-level additions from the get-go
>> would speed this expected need/addition.
>
> Ok, a quirk should be easy there. Thanks,
>
> Alex
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists