linux-kernel - Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <54193e39-e13d-a718-9194-79688e1ccde2@linux.vnet.ibm.com>
Date:	Thu, 30 Jun 2016 10:40:23 +0800
From:	Yongji Xie <xyjxie@...ux.vnet.ibm.com>
To:	Alex Williamson <alex.williamson@...hat.com>
Cc:	kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
	bhelgaas@...gle.com, aik@...abs.ru, benh@...nel.crashing.org,
	paulus@...ba.org, mpe@...erman.id.au, warrier@...ux.vnet.ibm.com,
	zhong@...ux.vnet.ibm.com, nikunj@...ux.vnet.ibm.com,
	gwshan@...ux.vnet.ibm.com, kevin.tian@...el.com
Subject: Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio
 page is exclusive

Hi Alex,

On 2016/6/30 4:03, Alex Williamson wrote:

> On Tue, 28 Jun 2016 13:47:23 -0600
> Alex Williamson <alex.williamson@...hat.com> wrote:
>
>> On Tue, 28 Jun 2016 18:09:46 +0800
>> Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:
>>
>>> Hi, Alex
>>>
>>> On 2016/6/25 0:43, Alex Williamson wrote:
>>>    
>>>> On Fri, 24 Jun 2016 23:37:02 +0800
>>>> Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:
>>>>     
>>>>> Hi, Alex
>>>>>
>>>>> On 2016/6/24 11:37, Alex Williamson wrote:
>>>>>     
>>>>>> On Fri, 24 Jun 2016 10:52:58 +0800
>>>>>> Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:
>>>>>>> On 2016/6/24 0:12, Alex Williamson wrote:
>>>>>>>> On Mon, 30 May 2016 21:06:37 +0800
>>>>>>>> Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:
>>>>>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>>>>>>>> +{
>>>>>>>>> +	struct resource *res;
>>>>>>>>> +	int bar;
>>>>>>>>> +	struct vfio_pci_dummy_resource *dummy_res;
>>>>>>>>> +
>>>>>>>>> +	INIT_LIST_HEAD(&vdev->dummy_resources_list);
>>>>>>>>> +
>>>>>>>>> +	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
>>>>>>>>> +		res = vdev->pdev->resource + bar;
>>>>>>>>> +
>>>>>>>>> +		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
>>>>>>>>> +			goto no_mmap;
>>>>>>>>> +
>>>>>>>>> +		if (!(res->flags & IORESOURCE_MEM))
>>>>>>>>> +			goto no_mmap;
>>>>>>>>> +
>>>>>>>>> +		/*
>>>>>>>>> +		 * The PCI core shouldn't set up a resource with a
>>>>>>>>> +		 * type but zero size. But there may be bugs that
>>>>>>>>> +		 * cause us to do that.
>>>>>>>>> +		 */
>>>>>>>>> +		if (!resource_size(res))
>>>>>>>>> +			goto no_mmap;
>>>>>>>>> +
>>>>>>>>> +		if (resource_size(res) >= PAGE_SIZE) {
>>>>>>>>> +			vdev->bar_mmap_supported[bar] = true;
>>>>>>>>> +			continue;
>>>>>>>>> +		}
>>>>>>>>> +
>>>>>>>>> +		if (!(res->start & ~PAGE_MASK)) {
>>>>>>>>> +			/*
>>>>>>>>> +			 * Add a dummy resource to reserve the remainder
>>>>>>>>> +			 * of the exclusive page in case that hot-add
>>>>>>>>> +			 * device's bar is assigned into it.
>>>>>>>>> +			 */
>>>>>>>>> +			dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
>>>>>>>>> +			if (dummy_res == NULL)
>>>>>>>>> +				goto no_mmap;
>>>>>>>>> +
>>>>>>>>> +			dummy_res->resource.start = res->end + 1;
>>>>>>>>> +			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
>>>>>>>>> +			dummy_res->resource.flags = res->flags;
>>>>>>>>> +			if (request_resource(res->parent,
>>>>>>>>> +						&dummy_res->resource)) {
>>>>>>>>> +				kfree(dummy_res);
>>>>>>>>> +				goto no_mmap;
>>>>>>>>> +			}
>>>>>>>> Isn't it true that request_resource() only tells us that at a given
>>>>>>>> point in time, no other drivers have reserved that resource?  It seems
>>>>>>>> like it does not guarantee that the resource isn't routed to another
>>>>>>>> device or that another driver won't at some point attempt to request
>>>>>>>> that same resource.  So for example if a user constructs their initrd
>>>>>>>> to bind vfio-pci to devices before other modules load, this
>>>>>>>> request_resource() may succeed, at the expense of drivers loaded later
>>>>>>>> now failing.  The behavior will depend on driver load order and we're
>>>>>>>> not actually insuring that the overflow resource is unused, just that
>>>>>>>> we got it first.  Can we do better?  Am I missing something that
>>>>>>>> prevents this?  Thanks,
>>>>>>>>
>>>>>>>> Alex
>>>>>>> Couldn't PCI resources allocator prevent this, which will find a
>>>>>>> empty slot in the resource tree firstly, then try to request that
>>>>>>> resource in allocate_resource() when a PCI device is probed.
>>>>>>> And I'd like to know why a PCI device driver would attempt to
>>>>>>> call request_resource()? Should this be done in PCI enumeration?
>>>>>> Hi Yongji,
>>>>>>
>>>>>> Looks like most pci drivers call pci_request_regions().  From there the
>>>>>> call path is:
>>>>>>
>>>>>> pci_request_selected_regions
>>>>>>      __pci_request_selected_regions
>>>>>>        __pci_request_region
>>>>>>          __request_mem_region
>>>>>>            __request_region
>>>>>>              __request_resource
>>>>>>
>>>>>> We see this driver ordering issue sometimes with users attempting to
>>>>>> blacklist native pci drivers, trying to leave a device free for use by
>>>>>> vfio-pci.  If the device is a graphics card, the generic vesa or uefi
>>>>>> driver can request device resources causing a failure when vfio-pci
>>>>>> tries to request those same resources.  I expect that unless it's a
>>>>>> boot device, like vga in my example, the resources are not enabled
>>>>>> until the driver opens the device, therefore the request_resource() call
>>>>>> doesn't occur until that point.
>>>>>>
>>>>>> For another trivial example, look at /proc/iomem as you load and unload
>>>>>> a driver, on my laptop with e1000e unloaded I see:
>>>>>>
>>>>>>      e1200000-e121ffff : 0000:00:19.0
>>>>>>      e123e000-e123efff : 0000:00:19.0
>>>>>>
>>>>>> When e1000e is loaded, each of these becomes claimed by the e1000e
>>>>>> driver:
>>>>>>
>>>>>>      e1200000-e121ffff : 0000:00:19.0
>>>>>>        e1200000-e121ffff : e1000e
>>>>>>      e123e000-e123efff : 0000:00:19.0
>>>>>>        e123e000-e123efff : e1000e
>>>>>>
>>>>>> Clearly pci core knows the resource is associated with the device, but
>>>>>> I don't think we're tapping into that with request_resource(), we're
>>>>>> just potentially stealing resources that another driver might have
>>>>>> claimed otherwise as I described above.  That's my suspicion at
>>>>>> least, feel free to show otherwise if it's incorrect.  Thanks,
>>>>>>
>>>>>> Alex
>>>>>>        
>>>>> Thanks for your explanation. But I still have one question.
>>>>> Shouldn't PCI core have claimed all PCI device's resources
>>>>> after probing those devices. If so, request_resource() will fail
>>>>> when vfio-pci try to steal resources that another driver might
>>>>> request later. Anything I missed here?  Some device resources
>>>>> would not be claimed in PCI core?
>>>> Hi Yongji,
>>>>
>>>> I don't know what to say, this is not how the interface currently
>>>> works.  request_resource() is a driver level interface that tries to
>>>> prevent drivers from claiming conflicting resources.  In this patch
>>>> you're trying to use it to probe whether a resource maps to another
>>>> device.  This is not what it does.  request_resource() will happily let
>>>> you claim any resource you want, so long as nobody else claimed it
>>>> first.  So the only case where the assumptions in this patch are valid
>>>> is if we can guarantee that any potentially conflicting device has a
>>>> driver loaded that has claimed those resources.  As it is here,
>>>> vfio-pci will happily attempt to request resources that might overlap
>>>> with another device and might break drivers that haven't yet had a
>>>> chance to probe their devices.  I don't think that's acceptable.
>>>> Thanks,
>>>>
>>>> Alex
>>>>     
>>> I'm trying to get your point. Let me give an example here.
>>> I'm not sure whether my understanding is right. Please
>>> point it out if I'm wrong.
>>>
>>> We assume that there are two PCI devices like this:
>>>
>>> 240000000000-24feffffffff : /pciex@...fe40400000
>>>     240000000000-2400ffffffff : PCI Bus 0002:01
>>>       240000000000-240000007fff : 0002:01:00.0
>>>         240000000000-240000007fff : vfio-pci
>>>       240000008000-24000000ffff : 0002:01:01.0
>>>         240000008000-24000000ffff : lpfc
>>>
>>> Do you mean vfio-pci driver will succeed in requesting
>>> dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K)
>>> if it is loaded before lpfc driver? Like this:
>>>
>>> 240000000000-24feffffffff : /pciex@...fe40400000
>>>     240000000000-2400ffffffff : PCI Bus 0002:01
>>>       240000000000-240000007fff : 0002:01:00.0
>>>         240000000000-240000007fff : vfio-pci
>>>       240000008000-24000000ffff : 0002:01:01.0
>>>         240000008000-24000000ffff : <BAD>    --> vfio-pci call
>>> request_resource()
>>>
>>> Then lpfc driver will fail when it attempts to call
>>> pci_request_regions() later.
>> Yes, that is my supposition.
>>   
>>> But is it possible that the dummy_res become the child of
>>> the res: 0002:01:01.0? Wouldn't request_resource() fail when
>>> it found parent res: PCI Bus 0002:01 already have conflict
>>> child res: 0002:01:01.0.
>>>
>>> And I think the case that request_resource() will succeed
>>> should like this:
>>>
>>> 240000000000-24feffffffff : /pciex@...fe40400000
>>>     240000000000-2400ffffffff : PCI Bus 0002:01
>>>       240000000000-240000007fff : 0002:01:00.0
>>>       240000010000-240000017fff : 0002:01:01.0
>>>
>>> There is a mem hole: [240000008000-24000000ffff] after
>>> PCI probing.  After loading drivers, the resources tree
>>> will be:
>>>
>>> 240000000000-24feffffffff : /pciex@...fe40400000
>>>     240000000000-2400ffffffff : PCI Bus 0002:01
>>>       240000000000-240000007fff : 0002:01:00.0
>>>         240000000000-240000007fff : vfio-pci
>>>       240000008000-24000000ffff : <BAD>    ---> vfio-pci call
>>> request_resource()
>>>       240000010000-240000017fff : 0002:01:01.0
>>>         240000010000-240000017fff : lpfc
>> Ok, let's stop guessing about this.  I modified your patch as follows
>> so I could easily test it on a 4k host:
>>
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>>   	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>>   }
>>   
>> +#define VFIO_64K_PAGE_SIZE (64*1024)
>> +#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1))
>> +
>>   static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>   {
>>   	struct resource *res;
>> @@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>   		if (!resource_size(res))
>>   			goto no_mmap;
>>   
>> -		if (resource_size(res) >= PAGE_SIZE) {
>> +		if (resource_size(res) >= VFIO_64K_PAGE_SIZE) {
>>   			vdev->bar_mmap_supported[bar] = true;
>>   			continue;
>>   		}
>>   
>> -		if (!(res->start & ~PAGE_MASK)) {
>> +		if (!(res->start & ~VFIO_64K_PAGE_MASK)) {
>> +			int ret;
>>   			/*
>>   			 * Add a dummy resource to reserve the remainder
>>   			 * of the exclusive page in case that hot-add
>> @@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>   				goto no_mmap;
>>   
>>   			dummy_res->resource.start = res->end + 1;
>> -			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
>> +			dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1;
>>   			dummy_res->resource.flags = res->flags;
>> -			if (request_resource(res->parent,
>> -						&dummy_res->resource)) {
>> +			ret = request_resource(res->parent,
>> +						&dummy_res->resource);
>> +			if (ret) {
>> +dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret);
>>   				kfree(dummy_res);
>>   				goto no_mmap;
>>   			}
>>
>> IOW, enforce 64k for mmap regardless of PAGE_SIZE.  Then I find a
>> system configuration to test it:
>>
>>    ee400000-ef4fffff : PCI Bus 0000:07
>>      ef480000-ef49ffff : 0000:07:00.0
>>        ef480000-ef483fff : 0000:08:10.0
>>        ef484000-ef487fff : 0000:08:10.2
>>        ef488000-ef48bfff : 0000:08:10.4
>>        ef48c000-ef48ffff : 0000:08:10.6
>>        ef490000-ef493fff : 0000:08:11.0
>>        ef494000-ef497fff : 0000:08:11.2
>>        ef498000-ef49bfff : 0000:08:11.4
>>      ef4a0000-ef4bffff : 0000:07:00.0
>>        ef4a0000-ef4a3fff : 0000:08:10.0
>>        ef4a4000-ef4a7fff : 0000:08:10.2
>>        ef4a8000-ef4abfff : 0000:08:10.4
>>        ef4ac000-ef4affff : 0000:08:10.6
>>        ef4b0000-ef4b3fff : 0000:08:11.0
>>        ef4b4000-ef4b7fff : 0000:08:11.2
>>        ef4b8000-ef4bbfff : 0000:08:11.4
>>
>> This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all
>> the VF BAR0s are in a contiguous range and all the VF BAR3s are in a
>> separate contiguous range.  The igbvf driver is not loaded, so the
>> other resources are free to be claimed, what happens?
>>
>> It looks like you're right, the request_resource() fails with:
>>
>> vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16)
>> vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16)
>>
>> So we get back -EBUSY which means we hit a conflict.  I would have
>> expected that this means our res->parent that we're using for
>> request_resource() is only, for instance, ef480000-ef483fff (ie. the
>> BAR itself) thus our request for ef484000-ef48ffff exceeds the end of
>> the parent, but adding the parent resource range to the dev_info(), it
>> actually shows the range being ef480000-ef49ffff, so the parent is
>> actually the 07:00.0 resource.  In fact, we can't even use
>> request_resource() like this to claim the BAR itself, which is why we
>> use pci_request_selected_regions(), which allows conflicts, putting the
>> requested resource at the leaf of the tree.
>>
>> So I guess I retract this concern about the use of request_resource(),
>> it does seem to behave as intended.  Unless I can spot anything else or
>> other reviewers have comments, I'll queue this into my next branch for
>> v4.8.  Thanks,
>
> Ok, one more test, I found that I have access to the following USB
> devices:
>
> 00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI])
> 	Region 0: Memory at f7a08000 (32-bit, non-prefetchable) [size=1K]
>
> 00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05) (prog-if 20 [EHCI])
> 	Region 0: Memory at f7a07000 (32-bit, non-prefetchable) [size=1K]
>
> These are nicely mapped such that vfio-pci can claim the residual space
> from the page, which results in the following in /proc/iomem:
>
>    f7a07000-f7a073ff : 0000:00:1d.0
>      f7a07000-f7a073ff : vfio
>    f7a07400-f7a07fff : <BAD>
>    f7a08000-f7a083ff : 0000:00:1a.0
>      f7a08000-f7a083ff : vfio
>    f7a08400-f7a08fff : <BAD>
>
> I should have looked more closely at your previous reply, I didn't
> think that "<BAD>" was literally the owner of these dummy resources.
> Please fix this to report something that isn't going frighten users
> and make small children cry.  Thanks,

Yeah, I also noticed that:-). Now I'm trying to find a proper
name. What do you think about "vfio-pci (dummy)"?

Thanks,
Yongji