linux-kernel - Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160628134723.7ffefcf3@t450s.home>
Date:	Tue, 28 Jun 2016 13:47:23 -0600
From:	Alex Williamson <alex.williamson@...hat.com>
To:	Yongji Xie <xyjxie@...ux.vnet.ibm.com>
Cc:	kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
	bhelgaas@...gle.com, aik@...abs.ru, benh@...nel.crashing.org,
	paulus@...ba.org, mpe@...erman.id.au, warrier@...ux.vnet.ibm.com,
	zhong@...ux.vnet.ibm.com, nikunj@...ux.vnet.ibm.com,
	gwshan@...ux.vnet.ibm.com, kevin.tian@...el.com
Subject: Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the
 mmio page is exclusive

On Tue, 28 Jun 2016 18:09:46 +0800
Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:

> Hi, Alex
> 
> On 2016/6/25 0:43, Alex Williamson wrote:
> 
> > On Fri, 24 Jun 2016 23:37:02 +0800
> > Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:
> >  
> >> Hi, Alex
> >>
> >> On 2016/6/24 11:37, Alex Williamson wrote:
> >>  
> >>> On Fri, 24 Jun 2016 10:52:58 +0800
> >>> Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:  
> >>>> On 2016/6/24 0:12, Alex Williamson wrote:  
> >>>>> On Mon, 30 May 2016 21:06:37 +0800
> >>>>> Yongji Xie <xyjxie@...ux.vnet.ibm.com> wrote:  
> >>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >>>>>> +{
> >>>>>> +	struct resource *res;
> >>>>>> +	int bar;
> >>>>>> +	struct vfio_pci_dummy_resource *dummy_res;
> >>>>>> +
> >>>>>> +	INIT_LIST_HEAD(&vdev->dummy_resources_list);
> >>>>>> +
> >>>>>> +	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
> >>>>>> +		res = vdev->pdev->resource + bar;
> >>>>>> +
> >>>>>> +		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> >>>>>> +			goto no_mmap;
> >>>>>> +
> >>>>>> +		if (!(res->flags & IORESOURCE_MEM))
> >>>>>> +			goto no_mmap;
> >>>>>> +
> >>>>>> +		/*
> >>>>>> +		 * The PCI core shouldn't set up a resource with a
> >>>>>> +		 * type but zero size. But there may be bugs that
> >>>>>> +		 * cause us to do that.
> >>>>>> +		 */
> >>>>>> +		if (!resource_size(res))
> >>>>>> +			goto no_mmap;
> >>>>>> +
> >>>>>> +		if (resource_size(res) >= PAGE_SIZE) {
> >>>>>> +			vdev->bar_mmap_supported[bar] = true;
> >>>>>> +			continue;
> >>>>>> +		}
> >>>>>> +
> >>>>>> +		if (!(res->start & ~PAGE_MASK)) {
> >>>>>> +			/*
> >>>>>> +			 * Add a dummy resource to reserve the remainder
> >>>>>> +			 * of the exclusive page in case that hot-add
> >>>>>> +			 * device's bar is assigned into it.
> >>>>>> +			 */
> >>>>>> +			dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
> >>>>>> +			if (dummy_res == NULL)
> >>>>>> +				goto no_mmap;
> >>>>>> +
> >>>>>> +			dummy_res->resource.start = res->end + 1;
> >>>>>> +			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
> >>>>>> +			dummy_res->resource.flags = res->flags;
> >>>>>> +			if (request_resource(res->parent,
> >>>>>> +						&dummy_res->resource)) {
> >>>>>> +				kfree(dummy_res);
> >>>>>> +				goto no_mmap;
> >>>>>> +			}  
> >>>>> Isn't it true that request_resource() only tells us that at a given
> >>>>> point in time, no other drivers have reserved that resource?  It seems
> >>>>> like it does not guarantee that the resource isn't routed to another
> >>>>> device or that another driver won't at some point attempt to request
> >>>>> that same resource.  So for example if a user constructs their initrd
> >>>>> to bind vfio-pci to devices before other modules load, this
> >>>>> request_resource() may succeed, at the expense of drivers loaded later
> >>>>> now failing.  The behavior will depend on driver load order and we're
> >>>>> not actually insuring that the overflow resource is unused, just that
> >>>>> we got it first.  Can we do better?  Am I missing something that
> >>>>> prevents this?  Thanks,
> >>>>>
> >>>>> Alex  
> >>>> Couldn't PCI resources allocator prevent this, which will find a
> >>>> empty slot in the resource tree firstly, then try to request that
> >>>> resource in allocate_resource() when a PCI device is probed.
> >>>> And I'd like to know why a PCI device driver would attempt to
> >>>> call request_resource()? Should this be done in PCI enumeration?  
> >>> Hi Yongji,
> >>>
> >>> Looks like most pci drivers call pci_request_regions().  From there the
> >>> call path is:
> >>>
> >>> pci_request_selected_regions
> >>>     __pci_request_selected_regions
> >>>       __pci_request_region
> >>>         __request_mem_region
> >>>           __request_region
> >>>             __request_resource
> >>>
> >>> We see this driver ordering issue sometimes with users attempting to
> >>> blacklist native pci drivers, trying to leave a device free for use by
> >>> vfio-pci.  If the device is a graphics card, the generic vesa or uefi
> >>> driver can request device resources causing a failure when vfio-pci
> >>> tries to request those same resources.  I expect that unless it's a
> >>> boot device, like vga in my example, the resources are not enabled
> >>> until the driver opens the device, therefore the request_resource() call
> >>> doesn't occur until that point.
> >>>
> >>> For another trivial example, look at /proc/iomem as you load and unload
> >>> a driver, on my laptop with e1000e unloaded I see:
> >>>
> >>>     e1200000-e121ffff : 0000:00:19.0
> >>>     e123e000-e123efff : 0000:00:19.0
> >>>
> >>> When e1000e is loaded, each of these becomes claimed by the e1000e
> >>> driver:
> >>>
> >>>     e1200000-e121ffff : 0000:00:19.0
> >>>       e1200000-e121ffff : e1000e
> >>>     e123e000-e123efff : 0000:00:19.0
> >>>       e123e000-e123efff : e1000e
> >>>
> >>> Clearly pci core knows the resource is associated with the device, but
> >>> I don't think we're tapping into that with request_resource(), we're
> >>> just potentially stealing resources that another driver might have
> >>> claimed otherwise as I described above.  That's my suspicion at
> >>> least, feel free to show otherwise if it's incorrect.  Thanks,
> >>>
> >>> Alex
> >>>     
> >> Thanks for your explanation. But I still have one question.
> >> Shouldn't PCI core have claimed all PCI device's resources
> >> after probing those devices. If so, request_resource() will fail
> >> when vfio-pci try to steal resources that another driver might
> >> request later. Anything I missed here?  Some device resources
> >> would not be claimed in PCI core?  
> > Hi Yongji,
> >
> > I don't know what to say, this is not how the interface currently
> > works.  request_resource() is a driver level interface that tries to
> > prevent drivers from claiming conflicting resources.  In this patch
> > you're trying to use it to probe whether a resource maps to another
> > device.  This is not what it does.  request_resource() will happily let
> > you claim any resource you want, so long as nobody else claimed it
> > first.  So the only case where the assumptions in this patch are valid
> > is if we can guarantee that any potentially conflicting device has a
> > driver loaded that has claimed those resources.  As it is here,
> > vfio-pci will happily attempt to request resources that might overlap
> > with another device and might break drivers that haven't yet had a
> > chance to probe their devices.  I don't think that's acceptable.
> > Thanks,
> >
> > Alex
> >  
> 
> I'm trying to get your point. Let me give an example here.
> I'm not sure whether my understanding is right. Please
> point it out if I'm wrong.
> 
> We assume that there are two PCI devices like this:
> 
> 240000000000-24feffffffff : /pciex@...fe40400000
>    240000000000-2400ffffffff : PCI Bus 0002:01
>      240000000000-240000007fff : 0002:01:00.0
>        240000000000-240000007fff : vfio-pci
>      240000008000-24000000ffff : 0002:01:01.0
>        240000008000-24000000ffff : lpfc
> 
> Do you mean vfio-pci driver will succeed in requesting
> dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K)
> if it is loaded before lpfc driver? Like this:
> 
> 240000000000-24feffffffff : /pciex@...fe40400000
>    240000000000-2400ffffffff : PCI Bus 0002:01
>      240000000000-240000007fff : 0002:01:00.0
>        240000000000-240000007fff : vfio-pci
>      240000008000-24000000ffff : 0002:01:01.0
>        240000008000-24000000ffff : <BAD>    --> vfio-pci call 
> request_resource()
> 
> Then lpfc driver will fail when it attempts to call
> pci_request_regions() later.

Yes, that is my supposition.
 
> But is it possible that the dummy_res become the child of
> the res: 0002:01:01.0? Wouldn't request_resource() fail when
> it found parent res: PCI Bus 0002:01 already have conflict
> child res: 0002:01:01.0.
> 
> And I think the case that request_resource() will succeed
> should like this:
> 
> 240000000000-24feffffffff : /pciex@...fe40400000
>    240000000000-2400ffffffff : PCI Bus 0002:01
>      240000000000-240000007fff : 0002:01:00.0
>      240000010000-240000017fff : 0002:01:01.0
> 
> There is a mem hole: [240000008000-24000000ffff] after
> PCI probing.  After loading drivers, the resources tree
> will be:
> 
> 240000000000-24feffffffff : /pciex@...fe40400000
>    240000000000-2400ffffffff : PCI Bus 0002:01
>      240000000000-240000007fff : 0002:01:00.0
>        240000000000-240000007fff : vfio-pci
>      240000008000-24000000ffff : <BAD>    ---> vfio-pci call 
> request_resource()
>      240000010000-240000017fff : 0002:01:01.0
>        240000010000-240000017fff : lpfc

Ok, let's stop guessing about this.  I modified your patch as follows
so I could easily test it on a 4k host:

--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
+#define VFIO_64K_PAGE_SIZE (64*1024)
+#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1))
+
 static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 {
 	struct resource *res;
@@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 		if (!resource_size(res))
 			goto no_mmap;
 
-		if (resource_size(res) >= PAGE_SIZE) {
+		if (resource_size(res) >= VFIO_64K_PAGE_SIZE) {
 			vdev->bar_mmap_supported[bar] = true;
 			continue;
 		}
 
-		if (!(res->start & ~PAGE_MASK)) {
+		if (!(res->start & ~VFIO_64K_PAGE_MASK)) {
+			int ret;
 			/*
 			 * Add a dummy resource to reserve the remainder
 			 * of the exclusive page in case that hot-add
@@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 				goto no_mmap;
 
 			dummy_res->resource.start = res->end + 1;
-			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
+			dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1;
 			dummy_res->resource.flags = res->flags;
-			if (request_resource(res->parent,
-						&dummy_res->resource)) {
+			ret = request_resource(res->parent,
+						&dummy_res->resource);
+			if (ret) {
+dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret);
 				kfree(dummy_res);
 				goto no_mmap;
 			}

IOW, enforce 64k for mmap regardless of PAGE_SIZE.  Then I find a
system configuration to test it:

  ee400000-ef4fffff : PCI Bus 0000:07
    ef480000-ef49ffff : 0000:07:00.0
      ef480000-ef483fff : 0000:08:10.0
      ef484000-ef487fff : 0000:08:10.2
      ef488000-ef48bfff : 0000:08:10.4
      ef48c000-ef48ffff : 0000:08:10.6
      ef490000-ef493fff : 0000:08:11.0
      ef494000-ef497fff : 0000:08:11.2
      ef498000-ef49bfff : 0000:08:11.4
    ef4a0000-ef4bffff : 0000:07:00.0
      ef4a0000-ef4a3fff : 0000:08:10.0
      ef4a4000-ef4a7fff : 0000:08:10.2
      ef4a8000-ef4abfff : 0000:08:10.4
      ef4ac000-ef4affff : 0000:08:10.6
      ef4b0000-ef4b3fff : 0000:08:11.0
      ef4b4000-ef4b7fff : 0000:08:11.2
      ef4b8000-ef4bbfff : 0000:08:11.4

This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all
the VF BAR0s are in a contiguous range and all the VF BAR3s are in a
separate contiguous range.  The igbvf driver is not loaded, so the
other resources are free to be claimed, what happens?

It looks like you're right, the request_resource() fails with:

vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16)
vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16)

So we get back -EBUSY which means we hit a conflict.  I would have
expected that this means our res->parent that we're using for
request_resource() is only, for instance, ef480000-ef483fff (ie. the
BAR itself) thus our request for ef484000-ef48ffff exceeds the end of
the parent, but adding the parent resource range to the dev_info(), it
actually shows the range being ef480000-ef49ffff, so the parent is
actually the 07:00.0 resource.  In fact, we can't even use
request_resource() like this to claim the BAR itself, which is why we
use pci_request_selected_regions(), which allows conflicts, putting the
requested resource at the leaf of the tree.

So I guess I retract this concern about the use of request_resource(),
it does seem to behave as intended.  Unless I can spot anything else or
other reviewers have comments, I'll queue this into my next branch for
v4.8.  Thanks,

Alex