lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Wed, 08 Aug 2012 15:33:00 -0400
From:	Don Dutile <ddutile@...hat.com>
To:	Alex Williamson <alex.williamson@...hat.com>
CC:	Bjorn Helgaas <bhelgaas@...gle.com>, linux-pci@...r.kernel.org,
	linux-kernel@...r.kernel.org, dsahern@...il.com
Subject: Re: [PATCH] pci: Account for virtual buses in pci_acs_path_enabled

On 08/08/2012 11:24 AM, Alex Williamson wrote:
> On Tue, 2012-08-07 at 23:00 -0700, Bjorn Helgaas wrote:
>> On Tue, Aug 7, 2012 at 2:50 PM, Don Dutile<ddutile@...hat.com>  wrote:
>>> On 08/06/2012 04:47 PM, Bjorn Helgaas wrote:
>>>>
>>>> On Sun, Aug 5, 2012 at 11:55 PM, Alex Williamson
>>>> <alex.williamson@...hat.com>   wrote:
>>>>>
>>>>> On Sun, 2012-08-05 at 23:30 -0600, Bjorn Helgaas wrote:
>>>>>>
>>>>>> On Sat, Aug 4, 2012 at 12:19 PM, Alex Williamson
>>>>>> <alex.williamson@...hat.com>   wrote:
>>>>>>>
>>>>>>> It's possible to have buses without an associated bridge
>>>>>>> (bus->self == NULL).  SR-IOV can generate such buses.  When
>>>>>>> we find these, skip to the parent bus to look for the next
>>>>>>> ACS test.
>>>>>>
>>>>>>
>>>>>> To make sure I understand the problem here, I think you're referring
>>>>>> to the situation where an SR-IOV device can span several bus numbers,
>>>>>> e.g., the "VFs Spanning Multiple Bus Numbers" implementation note in
>>>>>> the SR-IOV 1.1 spec, sec. 2.1.2.
>>>>>>
>>>>>> It says "All PFs must be located on the Device's captured Bus Number"
>>>>>> -- I think that means every PF will be directly on a bridge's
>>>>>> secondary bus and hence will have a valid dev->bus->self pointer.
>>>>>>
>>>>>> However, VFs need not be on the same bus number.  If a VF is on
>>>>>> (captured Bus Number plus 1), I think we allocate a new struct pci_bus
>>>>>> for it, but there's no P2P bridge that leads to that bus, so the
>>>>>> bus->self pointer is probably NULL.
>>>>>
>>>>>
>>>>> Yes, exactly.  virtfn_add_bus() is where we're creating this new bus.
>>>>>
>>>>>> This makes me quite nervous, because I bet there are many places that
>>>>>> assume every non-root bus has a valid bus->self pointer  -- I know I
>>>>>> certainly had that assumption.
>>>>>>
>>>>>> I looked at callers of pci_is_root_bus(), and at first glance, it seems
>>>>>> like
>>>>>> iommu_init_device(), intel_iommu_add_device(), pci_acs_path_enabled(),
>>>>>
>>>>>
>>>>>
>>>>> These 3 are handled by this patch, plus the intel and amd iommu patches
>>>>> I sent.
>>>>>
>>>>>> pci_get_interrupt_pin(), pci_common_swizzle(),
>>>>>
>>>>>
>>>>> If sr-iov is the only source of these virtual buses, these are probably
>>>>> ok since VFs don't support INTx.
>>>>>
>>>>>> pci_find_upstream_pcie_bridge(), and
>>>>>
>>>>>
>>>>> Here the pci_is_root_bus() is after a pci_is_pcie() check, so again if
>>>>> sr-iov only (and assuming VFs properly report PCIe capability), we
>>>>> shouldn't stumble on it.
>>>>>
>>>>>> pci_bus_release_bridge_resources() all might have similar problems.
>>>>>
>>>>>
>>>>> This one might deserve further investigation.  Thanks,
>>>>
>>>>
>>>> We can fix all these places piecemeal, but that doesn't feel like a
>>>> very satisfying solution.  It makes it much harder to know that each
>>>> place is correct, and this oddity of a bus with no upstream bridge is
>>>> still lying around, waiting to bite us again later.
>>>>
>>>> What other possible ways of fixing this do we have?  Could we set
>>>> bus->self (multiple buses would then point to the same bridge, and I
>>>> don't know if that would break something)?  Add something like a
>>>> pci_upstream_p2p_bridge() interface that would encapsulate traversing
>>>
>>>    ^^^ and this name will reduce the confusion? :)
>>
>> I don't claim that :)  I just wanted to explore other possible
>> solutions.  Changing every loop that searches the parent chain so it
>> knows about this SR-IOV oddity doesn't seem like the ideal solution,
>> though maybe it's the best we can do given the constraints.
>
> Yep, these are the two alternatives I thought of too, set bus->self or
> create a helper function.  The former leaves present assumptions in
> place, but I don't know whether we'll break something else having two
> buses claiming to be sourced by the same device.  Maintaining the NULL
> self pointer almost feels like a better representation of the hardware,
> but requires evaluating all the current users and coming up with a
> helper.  That's more work, but we probably want to move away from
> everyone manually walking pci data structures anyway.
>
>>>> Since these fake VF buses don't have a bridge that points to them, I
>>>
>>> Well, they aren't fake busses, just ARI-identifiers, which translate the
>>> B:D.F/8:5.3
>>> format to simply a 16-bit i.d.
>>
>> I think an SR-IOV device can consume multiple bus numbers even without
>> ARI (in fact, I think ARI  reduces the number of bus numbers the
>> device requires ... e.g., a PF and 15 VFs would require two bus
>> numbers without ARI (04:00.0 - 04:00.7 and 05:00.0 - 05:00.7) but only
>> one bus number with ARI (04:00.0 - 04:01.7)).  (I think "04:01.7" is
>> how Linux would represent the 8-bit function number ARI gives you.
>> You could also think of it as "04:00.0f")
>
> Yep, that's what I think is happening too.  I've been testing on a
> system with ARI, so using the same device as David, PFs at 1:00.0/1 and
> VFs at 1:10.0 - 1:11.5.  The sr-iov capability reports VF offset: 128,
> stride: 2.  IIRC, w/o ARI this same device reports VF offset 384 (256 +
> 128), which seems to match David's lspci.
>

Correct, lack of ARI forces the use of more bus numbers, not less.
It is 'fixed' by the iov.c code by increasing the sub_bus num register
to cover the range of busses the VF stride wants to use, if those bus-num's
aren't already consumed; if consumed, the VFs are inaccessible/unconfigurable.

>>
>>> So, VF devices should be attached to same bus->devices list as it's PF.
>>
>> I don't think it works that way today, does it?  In the SR-IOV spec
>> example in sec 2.1.2:
>>
>>    PF 0 at 04:00.0
>>    ARI Capable is set
>>    First VF Offset = 1, VF Stride = 1, NumVFs = 600
>>
>> I think we have three separate bus->devices lists:
>>
>>    pci_bus 04: devices list contains PF 0 and VF 0,1 through VF 0,255
>>    pci_bus 05: devices list contains VF 0,256 - VF 0,511
>>    pci_bus 06: devices list contains VF 0,512 - VF 0,600
>>
>>> pci_dev->bus should be same bus ptr as PF's pci_dev as well, since the
>>> VF uses all that's busses resources, support functions (cfg, dma-ops, etc.)
>>> as well.
>>> Searching the driver/pci area, support of functions like AER want the
>>> bus struct that's receiving/handling the PCIe error, associated (hw) port,
>>> etc.,
>>> so another reason the VF's pci-dev bus ptr should be the same as the PF's.
>>
>> Maybe every VF *should* have the same dev->bus pointer as the PF, but
>> I don't think it does today.  I think we only store the bus number in
>> the struct pci_bus, so if we *did* give all the VFs the same dev->bus
>> pointer and put all the VFs in the same bus->devices list, we'd have
>> to store the bus number elsewhere, e.g., in the struct pci_dev.
>>
>> That might make sense, but the magnitude of a change like that makes
>> my head hurt --  it would affect drivers, arch code, config accessors,
>> etc.
>
> Yep, I agree, that's a total rework of what a struct pci_bus represents.
> Thanks,
>
> Alex
>

To sum up, I believe the sw structures should reflect the hw connectivity
and association.  If you buy that bridge (all pun intended), then the VFs should
get connected to same pci-bus as the PF b/c it physically is, and the register
set for that parent bridge is the one the PF is connected to ...
bus->self == NULL will cause heartburn for AERs associated with VFs .... no dev to
point to in order to do AER handling ... :(
Likewise, additional bus structs could be generated, but they would require
ptrs in them to ensure they point to the same structures (same dev-lists, same
resource structs, etc.) and/or inherit the PF fields/attributes.
Changes to a pci-dev associated with a bridge will also have to check if
'peer' (VF/SRIOV) busses are associated with that bridge/pci-dev, and duplicate
updates -- which begs for the same bus struct to be used.

I think we have to break the identification of devices, B:D.F (VFs specifically)
and the existing 1-to-1 relationship with a hw device.   Helper functions
to change VF B:D.F formatting to be PF-B:[potentially->-32-D]:F would be one way to do it.
Changing devfn in pci-dev struct from u8 to u16 would one option;
putting the entire B:D.F into pci-dev and making it u32 would probably be the cleanest
(and I'm familiar with at least one OS & it's PCI developer that did just that ;-),
  but not for these ARI reasons), and then macros like PCI_FUNC() and PCI_SLOT()
could be changed appropriately, as well as a new PCI_BUS() macro.
The only serious gotcha I see in the B:D.F interpretation is how it is presented in
sysfs, since a D value >32 could cause some heartburn for user apps traversing
sysfs, and how those values would get copy / transposed to other interaces,
e.g., libvirt scanning sysfs for VFs, and how the B:D.F is parsed and passed
down to qemu when a device is assigned to a guest.  The 32-bit B:D.F in pci-dev,
and new helpers to present & parse based on that field are looking better to me now...

As for PCI tree traversal, endpoints shouldn't be doing it; I'll bet the majority
that do, are looking for the parent bus to get/print the bus-num;
That leaves PCI core code for bus scanning, hot plug, ACS tree traversal, etc.,
which should be manageable in a change like the ones I mention in the previous paragraph.
Alex: does VFIO use pci functions to do traversal, or does it implement its own
       functions to travers pci trees?

So, as stated previously, this certainly hurts the brain cells, but not as much
as I thought it would.
Someday, I'll try to figure out how mr-iov-based busses come & go in the pci dev tree! :-o !

- Don


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ