lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <45e20c39-6fcc-344c-1095-5fc023bacc37@linux.vnet.ibm.com>
Date:	Thu, 30 Jun 2016 11:29:59 +0800
From:	Yongji Xie <xyjxie@...ux.vnet.ibm.com>
To:	Alex Williamson <alex.williamson@...hat.com>
Cc:	kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
	bhelgaas@...gle.com, aik@...abs.ru, benh@...nel.crashing.org,
	paulus@...ba.org, mpe@...erman.id.au, warrier@...ux.vnet.ibm.com,
	zhong@...ux.vnet.ibm.com, nikunj@...ux.vnet.ibm.com,
	gwshan@...ux.vnet.ibm.com, kevin.tian@...el.com
Subject: Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio
 page is exclusive

On 2016/6/30 10:53, Alex Williamson wrote:

> On Thu, 30 Jun 2016 10:40:23 +0800
> Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:
>
>> Hi Alex,
>>
>> On 2016/6/30 4:03, Alex Williamson wrote:
>>
>>> On Tue, 28 Jun 2016 13:47:23 -0600
>>> Alex Williamson <alex.williamson@...hat.com> wrote:
>>>   
>>>> On Tue, 28 Jun 2016 18:09:46 +0800
>>>> Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:
>>>>   
>>>>> Hi, Alex
>>>>>
>>>>> On 2016/6/25 0:43, Alex Williamson wrote:
>>>>>       
>>>>>> On Fri, 24 Jun 2016 23:37:02 +0800
>>>>>> Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:
>>>>>>        
>>>>>>> Hi, Alex
>>>>>>>
>>>>>>> On 2016/6/24 11:37, Alex Williamson wrote:
>>>>>>>        
>>>>>>>> On Fri, 24 Jun 2016 10:52:58 +0800
>>>>>>>> Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:
>>>>>>>>> On 2016/6/24 0:12, Alex Williamson wrote:
>>>>>>>>>> On Mon, 30 May 2016 21:06:37 +0800
>>>>>>>>>> Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:
>>>>>>>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>>>>>>>>>> +{
>>>>>>>>>>> +	struct resource *res;
>>>>>>>>>>> +	int bar;
>>>>>>>>>>> +	struct vfio_pci_dummy_resource *dummy_res;
>>>>>>>>>>> +
>>>>>>>>>>> +	INIT_LIST_HEAD(&vdev->dummy_resources_list);
>>>>>>>>>>> +
>>>>>>>>>>> +	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
>>>>>>>>>>> +		res = vdev->pdev->resource + bar;
>>>>>>>>>>> +
>>>>>>>>>>> +		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
>>>>>>>>>>> +			goto no_mmap;
>>>>>>>>>>> +
>>>>>>>>>>> +		if (!(res->flags & IORESOURCE_MEM))
>>>>>>>>>>> +			goto no_mmap;
>>>>>>>>>>> +
>>>>>>>>>>> +		/*
>>>>>>>>>>> +		 * The PCI core shouldn't set up a resource with a
>>>>>>>>>>> +		 * type but zero size. But there may be bugs that
>>>>>>>>>>> +		 * cause us to do that.
>>>>>>>>>>> +		 */
>>>>>>>>>>> +		if (!resource_size(res))
>>>>>>>>>>> +			goto no_mmap;
>>>>>>>>>>> +
>>>>>>>>>>> +		if (resource_size(res) >= PAGE_SIZE) {
>>>>>>>>>>> +			vdev->bar_mmap_supported[bar] = true;
>>>>>>>>>>> +			continue;
>>>>>>>>>>> +		}
>>>>>>>>>>> +
>>>>>>>>>>> +		if (!(res->start & ~PAGE_MASK)) {
>>>>>>>>>>> +			/*
>>>>>>>>>>> +			 * Add a dummy resource to reserve the remainder
>>>>>>>>>>> +			 * of the exclusive page in case that hot-add
>>>>>>>>>>> +			 * device's bar is assigned into it.
>>>>>>>>>>> +			 */
>>>>>>>>>>> +			dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
>>>>>>>>>>> +			if (dummy_res == NULL)
>>>>>>>>>>> +				goto no_mmap;
>>>>>>>>>>> +
>>>>>>>>>>> +			dummy_res->resource.start = res->end + 1;
>>>>>>>>>>> +			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
>>>>>>>>>>> +			dummy_res->resource.flags = res->flags;
>>>>>>>>>>> +			if (request_resource(res->parent,
>>>>>>>>>>> +						&dummy_res->resource)) {
>>>>>>>>>>> +				kfree(dummy_res);
>>>>>>>>>>> +				goto no_mmap;
>>>>>>>>>>> +			}
>>>>>>>>>> Isn't it true that request_resource() only tells us that at a given
>>>>>>>>>> point in time, no other drivers have reserved that resource?  It seems
>>>>>>>>>> like it does not guarantee that the resource isn't routed to another
>>>>>>>>>> device or that another driver won't at some point attempt to request
>>>>>>>>>> that same resource.  So for example if a user constructs their initrd
>>>>>>>>>> to bind vfio-pci to devices before other modules load, this
>>>>>>>>>> request_resource() may succeed, at the expense of drivers loaded later
>>>>>>>>>> now failing.  The behavior will depend on driver load order and we're
>>>>>>>>>> not actually insuring that the overflow resource is unused, just that
>>>>>>>>>> we got it first.  Can we do better?  Am I missing something that
>>>>>>>>>> prevents this?  Thanks,
>>>>>>>>>>
>>>>>>>>>> Alex
>>>>>>>>> Couldn't PCI resources allocator prevent this, which will find a
>>>>>>>>> empty slot in the resource tree firstly, then try to request that
>>>>>>>>> resource in allocate_resource() when a PCI device is probed.
>>>>>>>>> And I'd like to know why a PCI device driver would attempt to
>>>>>>>>> call request_resource()? Should this be done in PCI enumeration?
>>>>>>>> Hi Yongji,
>>>>>>>>
>>>>>>>> Looks like most pci drivers call pci_request_regions().  From there the
>>>>>>>> call path is:
>>>>>>>>
>>>>>>>> pci_request_selected_regions
>>>>>>>>       __pci_request_selected_regions
>>>>>>>>         __pci_request_region
>>>>>>>>           __request_mem_region
>>>>>>>>             __request_region
>>>>>>>>               __request_resource
>>>>>>>>
>>>>>>>> We see this driver ordering issue sometimes with users attempting to
>>>>>>>> blacklist native pci drivers, trying to leave a device free for use by
>>>>>>>> vfio-pci.  If the device is a graphics card, the generic vesa or uefi
>>>>>>>> driver can request device resources causing a failure when vfio-pci
>>>>>>>> tries to request those same resources.  I expect that unless it's a
>>>>>>>> boot device, like vga in my example, the resources are not enabled
>>>>>>>> until the driver opens the device, therefore the request_resource() call
>>>>>>>> doesn't occur until that point.
>>>>>>>>
>>>>>>>> For another trivial example, look at /proc/iomem as you load and unload
>>>>>>>> a driver, on my laptop with e1000e unloaded I see:
>>>>>>>>
>>>>>>>>       e1200000-e121ffff : 0000:00:19.0
>>>>>>>>       e123e000-e123efff : 0000:00:19.0
>>>>>>>>
>>>>>>>> When e1000e is loaded, each of these becomes claimed by the e1000e
>>>>>>>> driver:
>>>>>>>>
>>>>>>>>       e1200000-e121ffff : 0000:00:19.0
>>>>>>>>         e1200000-e121ffff : e1000e
>>>>>>>>       e123e000-e123efff : 0000:00:19.0
>>>>>>>>         e123e000-e123efff : e1000e
>>>>>>>>
>>>>>>>> Clearly pci core knows the resource is associated with the device, but
>>>>>>>> I don't think we're tapping into that with request_resource(), we're
>>>>>>>> just potentially stealing resources that another driver might have
>>>>>>>> claimed otherwise as I described above.  That's my suspicion at
>>>>>>>> least, feel free to show otherwise if it's incorrect.  Thanks,
>>>>>>>>
>>>>>>>> Alex
>>>>>>>>           
>>>>>>> Thanks for your explanation. But I still have one question.
>>>>>>> Shouldn't PCI core have claimed all PCI device's resources
>>>>>>> after probing those devices. If so, request_resource() will fail
>>>>>>> when vfio-pci try to steal resources that another driver might
>>>>>>> request later. Anything I missed here?  Some device resources
>>>>>>> would not be claimed in PCI core?
>>>>>> Hi Yongji,
>>>>>>
>>>>>> I don't know what to say, this is not how the interface currently
>>>>>> works.  request_resource() is a driver level interface that tries to
>>>>>> prevent drivers from claiming conflicting resources.  In this patch
>>>>>> you're trying to use it to probe whether a resource maps to another
>>>>>> device.  This is not what it does.  request_resource() will happily let
>>>>>> you claim any resource you want, so long as nobody else claimed it
>>>>>> first.  So the only case where the assumptions in this patch are valid
>>>>>> is if we can guarantee that any potentially conflicting device has a
>>>>>> driver loaded that has claimed those resources.  As it is here,
>>>>>> vfio-pci will happily attempt to request resources that might overlap
>>>>>> with another device and might break drivers that haven't yet had a
>>>>>> chance to probe their devices.  I don't think that's acceptable.
>>>>>> Thanks,
>>>>>>
>>>>>> Alex
>>>>>>        
>>>>> I'm trying to get your point. Let me give an example here.
>>>>> I'm not sure whether my understanding is right. Please
>>>>> point it out if I'm wrong.
>>>>>
>>>>> We assume that there are two PCI devices like this:
>>>>>
>>>>> 240000000000-24feffffffff : /pciex@...fe40400000
>>>>>      240000000000-2400ffffffff : PCI Bus 0002:01
>>>>>        240000000000-240000007fff : 0002:01:00.0
>>>>>          240000000000-240000007fff : vfio-pci
>>>>>        240000008000-24000000ffff : 0002:01:01.0
>>>>>          240000008000-24000000ffff : lpfc
>>>>>
>>>>> Do you mean vfio-pci driver will succeed in requesting
>>>>> dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K)
>>>>> if it is loaded before lpfc driver? Like this:
>>>>>
>>>>> 240000000000-24feffffffff : /pciex@...fe40400000
>>>>>      240000000000-2400ffffffff : PCI Bus 0002:01
>>>>>        240000000000-240000007fff : 0002:01:00.0
>>>>>          240000000000-240000007fff : vfio-pci
>>>>>        240000008000-24000000ffff : 0002:01:01.0
>>>>>          240000008000-24000000ffff : <BAD>    --> vfio-pci call
>>>>> request_resource()
>>>>>
>>>>> Then lpfc driver will fail when it attempts to call
>>>>> pci_request_regions() later.
>>>> Yes, that is my supposition.
>>>>      
>>>>> But is it possible that the dummy_res become the child of
>>>>> the res: 0002:01:01.0? Wouldn't request_resource() fail when
>>>>> it found parent res: PCI Bus 0002:01 already have conflict
>>>>> child res: 0002:01:01.0.
>>>>>
>>>>> And I think the case that request_resource() will succeed
>>>>> should like this:
>>>>>
>>>>> 240000000000-24feffffffff : /pciex@...fe40400000
>>>>>      240000000000-2400ffffffff : PCI Bus 0002:01
>>>>>        240000000000-240000007fff : 0002:01:00.0
>>>>>        240000010000-240000017fff : 0002:01:01.0
>>>>>
>>>>> There is a mem hole: [240000008000-24000000ffff] after
>>>>> PCI probing.  After loading drivers, the resources tree
>>>>> will be:
>>>>>
>>>>> 240000000000-24feffffffff : /pciex@...fe40400000
>>>>>      240000000000-2400ffffffff : PCI Bus 0002:01
>>>>>        240000000000-240000007fff : 0002:01:00.0
>>>>>          240000000000-240000007fff : vfio-pci
>>>>>        240000008000-24000000ffff : <BAD>    ---> vfio-pci call
>>>>> request_resource()
>>>>>        240000010000-240000017fff : 0002:01:01.0
>>>>>          240000010000-240000017fff : lpfc
>>>> Ok, let's stop guessing about this.  I modified your patch as follows
>>>> so I could easily test it on a 4k host:
>>>>
>>>> --- a/drivers/vfio/pci/vfio_pci.c
>>>> +++ b/drivers/vfio/pci/vfio_pci.c
>>>> @@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>>>>    	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>>>>    }
>>>>    
>>>> +#define VFIO_64K_PAGE_SIZE (64*1024)
>>>> +#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1))
>>>> +
>>>>    static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>>>    {
>>>>    	struct resource *res;
>>>> @@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>>>    		if (!resource_size(res))
>>>>    			goto no_mmap;
>>>>    
>>>> -		if (resource_size(res) >= PAGE_SIZE) {
>>>> +		if (resource_size(res) >= VFIO_64K_PAGE_SIZE) {
>>>>    			vdev->bar_mmap_supported[bar] = true;
>>>>    			continue;
>>>>    		}
>>>>    
>>>> -		if (!(res->start & ~PAGE_MASK)) {
>>>> +		if (!(res->start & ~VFIO_64K_PAGE_MASK)) {
>>>> +			int ret;
>>>>    			/*
>>>>    			 * Add a dummy resource to reserve the remainder
>>>>    			 * of the exclusive page in case that hot-add
>>>> @@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>>>    				goto no_mmap;
>>>>    
>>>>    			dummy_res->resource.start = res->end + 1;
>>>> -			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
>>>> +			dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1;
>>>>    			dummy_res->resource.flags = res->flags;
>>>> -			if (request_resource(res->parent,
>>>> -						&dummy_res->resource)) {
>>>> +			ret = request_resource(res->parent,
>>>> +						&dummy_res->resource);
>>>> +			if (ret) {
>>>> +dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret);
>>>>    				kfree(dummy_res);
>>>>    				goto no_mmap;
>>>>    			}
>>>>
>>>> IOW, enforce 64k for mmap regardless of PAGE_SIZE.  Then I find a
>>>> system configuration to test it:
>>>>
>>>>     ee400000-ef4fffff : PCI Bus 0000:07
>>>>       ef480000-ef49ffff : 0000:07:00.0
>>>>         ef480000-ef483fff : 0000:08:10.0
>>>>         ef484000-ef487fff : 0000:08:10.2
>>>>         ef488000-ef48bfff : 0000:08:10.4
>>>>         ef48c000-ef48ffff : 0000:08:10.6
>>>>         ef490000-ef493fff : 0000:08:11.0
>>>>         ef494000-ef497fff : 0000:08:11.2
>>>>         ef498000-ef49bfff : 0000:08:11.4
>>>>       ef4a0000-ef4bffff : 0000:07:00.0
>>>>         ef4a0000-ef4a3fff : 0000:08:10.0
>>>>         ef4a4000-ef4a7fff : 0000:08:10.2
>>>>         ef4a8000-ef4abfff : 0000:08:10.4
>>>>         ef4ac000-ef4affff : 0000:08:10.6
>>>>         ef4b0000-ef4b3fff : 0000:08:11.0
>>>>         ef4b4000-ef4b7fff : 0000:08:11.2
>>>>         ef4b8000-ef4bbfff : 0000:08:11.4
>>>>
>>>> This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all
>>>> the VF BAR0s are in a contiguous range and all the VF BAR3s are in a
>>>> separate contiguous range.  The igbvf driver is not loaded, so the
>>>> other resources are free to be claimed, what happens?
>>>>
>>>> It looks like you're right, the request_resource() fails with:
>>>>
>>>> vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16)
>>>> vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16)
>>>>
>>>> So we get back -EBUSY which means we hit a conflict.  I would have
>>>> expected that this means our res->parent that we're using for
>>>> request_resource() is only, for instance, ef480000-ef483fff (ie. the
>>>> BAR itself) thus our request for ef484000-ef48ffff exceeds the end of
>>>> the parent, but adding the parent resource range to the dev_info(), it
>>>> actually shows the range being ef480000-ef49ffff, so the parent is
>>>> actually the 07:00.0 resource.  In fact, we can't even use
>>>> request_resource() like this to claim the BAR itself, which is why we
>>>> use pci_request_selected_regions(), which allows conflicts, putting the
>>>> requested resource at the leaf of the tree.
>>>>
>>>> So I guess I retract this concern about the use of request_resource(),
>>>> it does seem to behave as intended.  Unless I can spot anything else or
>>>> other reviewers have comments, I'll queue this into my next branch for
>>>> v4.8.  Thanks,
>>> Ok, one more test, I found that I have access to the following USB
>>> devices:
>>>
>>> 00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI])
>>> 	Region 0: Memory at f7a08000 (32-bit, non-prefetchable) [size=1K]
>>>
>>> 00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05) (prog-if 20 [EHCI])
>>> 	Region 0: Memory at f7a07000 (32-bit, non-prefetchable) [size=1K]
>>>
>>> These are nicely mapped such that vfio-pci can claim the residual space
>>> from the page, which results in the following in /proc/iomem:
>>>
>>>     f7a07000-f7a073ff : 0000:00:1d.0
>>>       f7a07000-f7a073ff : vfio
>>>     f7a07400-f7a07fff : <BAD>
>>>     f7a08000-f7a083ff : 0000:00:1a.0
>>>       f7a08000-f7a083ff : vfio
>>>     f7a08400-f7a08fff : <BAD>
>>>
>>> I should have looked more closely at your previous reply, I didn't
>>> think that "<BAD>" was literally the owner of these dummy resources.
>>> Please fix this to report something that isn't going frighten users
>>> and make small children cry.  Thanks,
>> Yeah, I also noticed that:-). Now I'm trying to find a proper
>> name. What do you think about "vfio-pci (dummy)"?
> How about "vfio sub-page reserved"?  Thanks,

Sounds good. I'll send a new version soon.

Thanks,
Yongji

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ