[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <54193e39-e13d-a718-9194-79688e1ccde2@linux.vnet.ibm.com>
Date: Thu, 30 Jun 2016 10:40:23 +0800
From: Yongji Xie <xyjxie@...ux.vnet.ibm.com>
To: Alex Williamson <alex.williamson@...hat.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
bhelgaas@...gle.com, aik@...abs.ru, benh@...nel.crashing.org,
paulus@...ba.org, mpe@...erman.id.au, warrier@...ux.vnet.ibm.com,
zhong@...ux.vnet.ibm.com, nikunj@...ux.vnet.ibm.com,
gwshan@...ux.vnet.ibm.com, kevin.tian@...el.com
Subject: Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio
page is exclusive
Hi Alex,
On 2016/6/30 4:03, Alex Williamson wrote:
> On Tue, 28 Jun 2016 13:47:23 -0600
> Alex Williamson <alex.williamson@...hat.com> wrote:
>
>> On Tue, 28 Jun 2016 18:09:46 +0800
>> Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:
>>
>>> Hi, Alex
>>>
>>> On 2016/6/25 0:43, Alex Williamson wrote:
>>>
>>>> On Fri, 24 Jun 2016 23:37:02 +0800
>>>> Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:
>>>>
>>>>> Hi, Alex
>>>>>
>>>>> On 2016/6/24 11:37, Alex Williamson wrote:
>>>>>
>>>>>> On Fri, 24 Jun 2016 10:52:58 +0800
>>>>>> Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:
>>>>>>> On 2016/6/24 0:12, Alex Williamson wrote:
>>>>>>>> On Mon, 30 May 2016 21:06:37 +0800
>>>>>>>> Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:
>>>>>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>>>>>>>> +{
>>>>>>>>> + struct resource *res;
>>>>>>>>> + int bar;
>>>>>>>>> + struct vfio_pci_dummy_resource *dummy_res;
>>>>>>>>> +
>>>>>>>>> + INIT_LIST_HEAD(&vdev->dummy_resources_list);
>>>>>>>>> +
>>>>>>>>> + for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
>>>>>>>>> + res = vdev->pdev->resource + bar;
>>>>>>>>> +
>>>>>>>>> + if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
>>>>>>>>> + goto no_mmap;
>>>>>>>>> +
>>>>>>>>> + if (!(res->flags & IORESOURCE_MEM))
>>>>>>>>> + goto no_mmap;
>>>>>>>>> +
>>>>>>>>> + /*
>>>>>>>>> + * The PCI core shouldn't set up a resource with a
>>>>>>>>> + * type but zero size. But there may be bugs that
>>>>>>>>> + * cause us to do that.
>>>>>>>>> + */
>>>>>>>>> + if (!resource_size(res))
>>>>>>>>> + goto no_mmap;
>>>>>>>>> +
>>>>>>>>> + if (resource_size(res) >= PAGE_SIZE) {
>>>>>>>>> + vdev->bar_mmap_supported[bar] = true;
>>>>>>>>> + continue;
>>>>>>>>> + }
>>>>>>>>> +
>>>>>>>>> + if (!(res->start & ~PAGE_MASK)) {
>>>>>>>>> + /*
>>>>>>>>> + * Add a dummy resource to reserve the remainder
>>>>>>>>> + * of the exclusive page in case that hot-add
>>>>>>>>> + * device's bar is assigned into it.
>>>>>>>>> + */
>>>>>>>>> + dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
>>>>>>>>> + if (dummy_res == NULL)
>>>>>>>>> + goto no_mmap;
>>>>>>>>> +
>>>>>>>>> + dummy_res->resource.start = res->end + 1;
>>>>>>>>> + dummy_res->resource.end = res->start + PAGE_SIZE - 1;
>>>>>>>>> + dummy_res->resource.flags = res->flags;
>>>>>>>>> + if (request_resource(res->parent,
>>>>>>>>> + &dummy_res->resource)) {
>>>>>>>>> + kfree(dummy_res);
>>>>>>>>> + goto no_mmap;
>>>>>>>>> + }
>>>>>>>> Isn't it true that request_resource() only tells us that at a given
>>>>>>>> point in time, no other drivers have reserved that resource? It seems
>>>>>>>> like it does not guarantee that the resource isn't routed to another
>>>>>>>> device or that another driver won't at some point attempt to request
>>>>>>>> that same resource. So for example if a user constructs their initrd
>>>>>>>> to bind vfio-pci to devices before other modules load, this
>>>>>>>> request_resource() may succeed, at the expense of drivers loaded later
>>>>>>>> now failing. The behavior will depend on driver load order and we're
>>>>>>>> not actually insuring that the overflow resource is unused, just that
>>>>>>>> we got it first. Can we do better? Am I missing something that
>>>>>>>> prevents this? Thanks,
>>>>>>>>
>>>>>>>> Alex
>>>>>>> Couldn't PCI resources allocator prevent this, which will find a
>>>>>>> empty slot in the resource tree firstly, then try to request that
>>>>>>> resource in allocate_resource() when a PCI device is probed.
>>>>>>> And I'd like to know why a PCI device driver would attempt to
>>>>>>> call request_resource()? Should this be done in PCI enumeration?
>>>>>> Hi Yongji,
>>>>>>
>>>>>> Looks like most pci drivers call pci_request_regions(). From there the
>>>>>> call path is:
>>>>>>
>>>>>> pci_request_selected_regions
>>>>>> __pci_request_selected_regions
>>>>>> __pci_request_region
>>>>>> __request_mem_region
>>>>>> __request_region
>>>>>> __request_resource
>>>>>>
>>>>>> We see this driver ordering issue sometimes with users attempting to
>>>>>> blacklist native pci drivers, trying to leave a device free for use by
>>>>>> vfio-pci. If the device is a graphics card, the generic vesa or uefi
>>>>>> driver can request device resources causing a failure when vfio-pci
>>>>>> tries to request those same resources. I expect that unless it's a
>>>>>> boot device, like vga in my example, the resources are not enabled
>>>>>> until the driver opens the device, therefore the request_resource() call
>>>>>> doesn't occur until that point.
>>>>>>
>>>>>> For another trivial example, look at /proc/iomem as you load and unload
>>>>>> a driver, on my laptop with e1000e unloaded I see:
>>>>>>
>>>>>> e1200000-e121ffff : 0000:00:19.0
>>>>>> e123e000-e123efff : 0000:00:19.0
>>>>>>
>>>>>> When e1000e is loaded, each of these becomes claimed by the e1000e
>>>>>> driver:
>>>>>>
>>>>>> e1200000-e121ffff : 0000:00:19.0
>>>>>> e1200000-e121ffff : e1000e
>>>>>> e123e000-e123efff : 0000:00:19.0
>>>>>> e123e000-e123efff : e1000e
>>>>>>
>>>>>> Clearly pci core knows the resource is associated with the device, but
>>>>>> I don't think we're tapping into that with request_resource(), we're
>>>>>> just potentially stealing resources that another driver might have
>>>>>> claimed otherwise as I described above. That's my suspicion at
>>>>>> least, feel free to show otherwise if it's incorrect. Thanks,
>>>>>>
>>>>>> Alex
>>>>>>
>>>>> Thanks for your explanation. But I still have one question.
>>>>> Shouldn't PCI core have claimed all PCI device's resources
>>>>> after probing those devices. If so, request_resource() will fail
>>>>> when vfio-pci try to steal resources that another driver might
>>>>> request later. Anything I missed here? Some device resources
>>>>> would not be claimed in PCI core?
>>>> Hi Yongji,
>>>>
>>>> I don't know what to say, this is not how the interface currently
>>>> works. request_resource() is a driver level interface that tries to
>>>> prevent drivers from claiming conflicting resources. In this patch
>>>> you're trying to use it to probe whether a resource maps to another
>>>> device. This is not what it does. request_resource() will happily let
>>>> you claim any resource you want, so long as nobody else claimed it
>>>> first. So the only case where the assumptions in this patch are valid
>>>> is if we can guarantee that any potentially conflicting device has a
>>>> driver loaded that has claimed those resources. As it is here,
>>>> vfio-pci will happily attempt to request resources that might overlap
>>>> with another device and might break drivers that haven't yet had a
>>>> chance to probe their devices. I don't think that's acceptable.
>>>> Thanks,
>>>>
>>>> Alex
>>>>
>>> I'm trying to get your point. Let me give an example here.
>>> I'm not sure whether my understanding is right. Please
>>> point it out if I'm wrong.
>>>
>>> We assume that there are two PCI devices like this:
>>>
>>> 240000000000-24feffffffff : /pciex@...fe40400000
>>> 240000000000-2400ffffffff : PCI Bus 0002:01
>>> 240000000000-240000007fff : 0002:01:00.0
>>> 240000000000-240000007fff : vfio-pci
>>> 240000008000-24000000ffff : 0002:01:01.0
>>> 240000008000-24000000ffff : lpfc
>>>
>>> Do you mean vfio-pci driver will succeed in requesting
>>> dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K)
>>> if it is loaded before lpfc driver? Like this:
>>>
>>> 240000000000-24feffffffff : /pciex@...fe40400000
>>> 240000000000-2400ffffffff : PCI Bus 0002:01
>>> 240000000000-240000007fff : 0002:01:00.0
>>> 240000000000-240000007fff : vfio-pci
>>> 240000008000-24000000ffff : 0002:01:01.0
>>> 240000008000-24000000ffff : <BAD> --> vfio-pci call
>>> request_resource()
>>>
>>> Then lpfc driver will fail when it attempts to call
>>> pci_request_regions() later.
>> Yes, that is my supposition.
>>
>>> But is it possible that the dummy_res become the child of
>>> the res: 0002:01:01.0? Wouldn't request_resource() fail when
>>> it found parent res: PCI Bus 0002:01 already have conflict
>>> child res: 0002:01:01.0.
>>>
>>> And I think the case that request_resource() will succeed
>>> should like this:
>>>
>>> 240000000000-24feffffffff : /pciex@...fe40400000
>>> 240000000000-2400ffffffff : PCI Bus 0002:01
>>> 240000000000-240000007fff : 0002:01:00.0
>>> 240000010000-240000017fff : 0002:01:01.0
>>>
>>> There is a mem hole: [240000008000-24000000ffff] after
>>> PCI probing. After loading drivers, the resources tree
>>> will be:
>>>
>>> 240000000000-24feffffffff : /pciex@...fe40400000
>>> 240000000000-2400ffffffff : PCI Bus 0002:01
>>> 240000000000-240000007fff : 0002:01:00.0
>>> 240000000000-240000007fff : vfio-pci
>>> 240000008000-24000000ffff : <BAD> ---> vfio-pci call
>>> request_resource()
>>> 240000010000-240000017fff : 0002:01:01.0
>>> 240000010000-240000017fff : lpfc
>> Ok, let's stop guessing about this. I modified your patch as follows
>> so I could easily test it on a 4k host:
>>
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>> return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>> }
>>
>> +#define VFIO_64K_PAGE_SIZE (64*1024)
>> +#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1))
>> +
>> static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>> {
>> struct resource *res;
>> @@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>> if (!resource_size(res))
>> goto no_mmap;
>>
>> - if (resource_size(res) >= PAGE_SIZE) {
>> + if (resource_size(res) >= VFIO_64K_PAGE_SIZE) {
>> vdev->bar_mmap_supported[bar] = true;
>> continue;
>> }
>>
>> - if (!(res->start & ~PAGE_MASK)) {
>> + if (!(res->start & ~VFIO_64K_PAGE_MASK)) {
>> + int ret;
>> /*
>> * Add a dummy resource to reserve the remainder
>> * of the exclusive page in case that hot-add
>> @@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>> goto no_mmap;
>>
>> dummy_res->resource.start = res->end + 1;
>> - dummy_res->resource.end = res->start + PAGE_SIZE - 1;
>> + dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1;
>> dummy_res->resource.flags = res->flags;
>> - if (request_resource(res->parent,
>> - &dummy_res->resource)) {
>> + ret = request_resource(res->parent,
>> + &dummy_res->resource);
>> + if (ret) {
>> +dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret);
>> kfree(dummy_res);
>> goto no_mmap;
>> }
>>
>> IOW, enforce 64k for mmap regardless of PAGE_SIZE. Then I find a
>> system configuration to test it:
>>
>> ee400000-ef4fffff : PCI Bus 0000:07
>> ef480000-ef49ffff : 0000:07:00.0
>> ef480000-ef483fff : 0000:08:10.0
>> ef484000-ef487fff : 0000:08:10.2
>> ef488000-ef48bfff : 0000:08:10.4
>> ef48c000-ef48ffff : 0000:08:10.6
>> ef490000-ef493fff : 0000:08:11.0
>> ef494000-ef497fff : 0000:08:11.2
>> ef498000-ef49bfff : 0000:08:11.4
>> ef4a0000-ef4bffff : 0000:07:00.0
>> ef4a0000-ef4a3fff : 0000:08:10.0
>> ef4a4000-ef4a7fff : 0000:08:10.2
>> ef4a8000-ef4abfff : 0000:08:10.4
>> ef4ac000-ef4affff : 0000:08:10.6
>> ef4b0000-ef4b3fff : 0000:08:11.0
>> ef4b4000-ef4b7fff : 0000:08:11.2
>> ef4b8000-ef4bbfff : 0000:08:11.4
>>
>> This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all
>> the VF BAR0s are in a contiguous range and all the VF BAR3s are in a
>> separate contiguous range. The igbvf driver is not loaded, so the
>> other resources are free to be claimed, what happens?
>>
>> It looks like you're right, the request_resource() fails with:
>>
>> vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16)
>> vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16)
>>
>> So we get back -EBUSY which means we hit a conflict. I would have
>> expected that this means our res->parent that we're using for
>> request_resource() is only, for instance, ef480000-ef483fff (ie. the
>> BAR itself) thus our request for ef484000-ef48ffff exceeds the end of
>> the parent, but adding the parent resource range to the dev_info(), it
>> actually shows the range being ef480000-ef49ffff, so the parent is
>> actually the 07:00.0 resource. In fact, we can't even use
>> request_resource() like this to claim the BAR itself, which is why we
>> use pci_request_selected_regions(), which allows conflicts, putting the
>> requested resource at the leaf of the tree.
>>
>> So I guess I retract this concern about the use of request_resource(),
>> it does seem to behave as intended. Unless I can spot anything else or
>> other reviewers have comments, I'll queue this into my next branch for
>> v4.8. Thanks,
>
> Ok, one more test, I found that I have access to the following USB
> devices:
>
> 00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI])
> Region 0: Memory at f7a08000 (32-bit, non-prefetchable) [size=1K]
>
> 00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05) (prog-if 20 [EHCI])
> Region 0: Memory at f7a07000 (32-bit, non-prefetchable) [size=1K]
>
> These are nicely mapped such that vfio-pci can claim the residual space
> from the page, which results in the following in /proc/iomem:
>
> f7a07000-f7a073ff : 0000:00:1d.0
> f7a07000-f7a073ff : vfio
> f7a07400-f7a07fff : <BAD>
> f7a08000-f7a083ff : 0000:00:1a.0
> f7a08000-f7a083ff : vfio
> f7a08400-f7a08fff : <BAD>
>
> I should have looked more closely at your previous reply, I didn't
> think that "<BAD>" was literally the owner of these dummy resources.
> Please fix this to report something that isn't going frighten users
> and make small children cry. Thanks,
Yeah, I also noticed that:-). Now I'm trying to find a proper
name. What do you think about "vfio-pci (dummy)"?
Thanks,
Yongji
Powered by blists - more mailing lists