linux-kernel - Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0f9bbdc4-deb1-5604-31d0-5ab97084fbb3@linux.ibm.com>
Date:   Wed, 20 Jan 2021 12:40:05 -0500
From:   Matthew Rosato <mjrosato@...ux.ibm.com>
To:     Niklas Schnelle <schnelle@...ux.ibm.com>,
        alex.williamson@...hat.com, cohuck@...hat.com
Cc:     pmorel@...ux.ibm.com, borntraeger@...ibm.com, hca@...ux.ibm.com,
        gor@...ux.ibm.com, gerald.schaefer@...ux.ibm.com,
        linux-s390@...r.kernel.org, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region

On 1/20/21 12:28 PM, Niklas Schnelle wrote:
> 
> 
> On 1/20/21 6:10 PM, Matthew Rosato wrote:
>> On 1/20/21 8:21 AM, Niklas Schnelle wrote:
>>>
>>>
>>> On 1/19/21 9:02 PM, Matthew Rosato wrote:
>>>> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
>>> .. snip ...
>>>> +
>>>> +static size_t vfio_pci_zdev_io_rw(struct vfio_pci_device *vdev,
>>>> +                  char __user *buf, size_t count,
>>>> +                  loff_t *ppos, bool iswrite)
>>>> +{
>>> ... snip ...
>>>> +    /*
>>>> +     * For now, the largest allowed block I/O is advertised as PAGE_SIZE,
>>>> +     * and cannot exceed a page boundary - so a single page is enough.  The
>>>> +     * guest should have validated this but let's double-check that the
>>>> +     * request will not cross a page boundary.
>>>> +     */
>>>> +    if (((region->req.gaddr & ~PAGE_MASK)
>>>> +            + region->req.len - 1) & PAGE_MASK) {
>>>> +        return -EIO;
>>>> +    }
>>>> +
>>>> +    mutex_lock(&zdev->lock);
>>>
>>> I plan on using the zdev->lock for preventing concurrent zPCI devices
>>> removal/configuration state changes between zPCI availability/error
>>> events and enable_slot()/disable_slot() and /sys/bus/pci/devices/<dev>/recover.
>>>
>>> With that use in place using it here causes a deadlock when doing
>>> "echo 0 > /sys/bus/pci/slots/<fid>/power from the host for an ISM device
>>> attached to a guest.
>>>
>>> This is because the (soft) hot unplug will cause vfio to notify QEMU, which
>>> sends a deconfiguration request to the guest, which then tries to
>>> gracefully shutdown the device. During that shutdown the device will
>>> be accessed, running into this code path which then blocks on
>>> the lock held by the disable_slot() code which waits on vfio
>>> releasing the device.
>>>
>>
>> Oof, good catch.  The primary purpose of acquiring the zdev lock here was to ensure that the region is only being used to process one operation at a time and at the time I wrote this initially the zdev lock seemed like a reasonable candidate :)
> 
> Oh ok, I thought it was to guard against removal. What's the reason
> we can't have multiple concurrent users of the region though?
> The PCISTB/PCILG should be able to be executed concurrently but
> I'm probably missing something.
> 

Yes, we are protecting the vfio region data, not PCISTB/PCILG execution.

When we enter vfio_pci_zdev_io_rw() with an I/O operation (a 'write' 
operation from the guest) we read in the data from the guest and place 
it into vdev->region[].data.  The data being read in is effectively 
instructing us what I/O operation to perform.  Now, if you have 2 
threads performing a write operation to this function at the same time 
(and furthermore, at the same or overlapping positions), they would 
overwrite (or collide) with each other, resulting in contents written to 
vdev->region[].data that are unpredictable and thus the wrong I/O 
operations probably done.

It so happens that the QEMU implementation will ensure that won't 
happen, but the kernel interface also needs to make sure it doesn't happen.

>>
>>> Alex may correct me if I'm wrong but I think instead vfio should
>>> be holding the PCI device lock via pci_device_lock(pdev).
>>>
>>
>> OK, I can have a look at this.  Thanks.
> 
> Hmm, ok that idea was based on my assumption that it guards
> against removal. My bad, of course without my patch it wouldn't
> and this came before..
> 

That would be a nice added bonus (we discussed it offline late last year 
so that's probably why you're thinking this).

>>
>>
>>> The annotated trace with my test code looks as follows:
>>>
>>> [  618.025091] Call Trace:
>>> [  618.025093]  [<00000007c1a139e0>] __schedule+0x360/0x960
>>> [  618.025104]  [<00000007c1a14760>] schedule_preempt_disabled+0x60/0x100
>>> [  618.025107]  [<00000007c1a16b48>] __mutex_lock+0x358/0x880
>>> [  618.025110]  [<00000007c1a170a2>] mutex_lock_nested+0x32/0x40
>>> [  618.025112]  [<000003ff805a3948>] vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci]
>>> [  618.025120]  [<000003ff8059b2b0>] vfio_pci_write+0xd0/0xe0 [vfio_pci]
>>> [  618.025124]  [<00000007c0fa5392>] __s390x_sys_pwrite64+0x112/0x360
>>> [  618.025129]  [<00000007c1a0aaf6>] __do_syscall+0x116/0x190
>>> [  618.025132]  [<00000007c1a1deda>] system_call+0x72/0x98
>>> [  618.025137] 1 lock held by qemu-system-s39/1315:
>>> [  618.025139]  #0: 000000008524b4e8 (&zdev->lock){....}-{3:3}, at: vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci]
>>> [  618.025151]
>>>                  Showing all locks held in the system:
>>> [  618.025166] 1 lock held by khungtaskd/99:
>>> [  618.025168]  #0: 00000007c1ed4748 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x210
>>> [  618.025194] 6 locks held by zsh/1190:
>>> [  618.025196]  #0: 0000000095fc0488 (sb_writers#3){....}-{0:0}, at: __do_syscall+0x116/0x190
>>> [  618.025226]  #1: 00000000975bf090 (&of->mutex){....}-{3:3}, at: kernfs_fop_write+0x9a/0x240
>>> [  618.025236]  #2: 000000008584be78 (kn->active#245){....}-{0:0}, at: kernfs_fop_write+0xa6/0x240
>>> [  618.025243]  #3: 000000008524b4e8 (&zdev->lock){....}-{3:3}, at: disable_slot+0x32/0x130 <-------------------------------------|
>>> [  618.025252]  #4: 00000007c1f53468 (pci_rescan_remove_lock){....}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x240   |
>>> [  618.025260]  #5: 0000000085d8a1a0 (&dev->mutex){....}-{3:3}, at: device_release_driver+0x32/0x1d0                              |
>>> [  618.025271] 1 lock held by qemu-system-s39/1312:                                                                               D
>>> [  618.025273] 1 lock held by qemu-system-s39/1313:                                                                               E
>>> [  618.025275]  #0: 00000000d47e80d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]                              A
>>> [  618.025322] 1 lock held by qemu-system-s39/1314:                                                                               D
>>> [  618.025324]  #0: 00000000d34700d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]                              |
>>> [  618.025345] 1 lock held by qemu-system-s39/1315:                                                                               |
>>> [  618.025347]  #0: 000000008524b4e8 (&zdev->lock){....}-{3:3}, at: vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci] <------------------|
>>> [  618.025355] 1 lock held by qemu-system-s39/1317:
>>> [  618.025357]  #0: 00000000d34480d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]
>>> [  618.025378] 1 lock held by qemu-system-s39/1318:
>>> [  618.025380]  #0: 00000000d34380d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]
>>> [  618.025400] 1 lock held by qemu-system-s39/1319:
>>> [  618.025403]  #0: 00000000d47e8a90 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]
>>> [  618.025424] 2 locks held by zsh/1391:
>>> [  618.025426]  #0: 00000000d4a708a0 (&tty->ldisc_sem){....}-{0:0}, at: tty_ldisc_ref_wait+0x34/0x70
>>> [  618.025435]  #1: 0000038002fc72f0 (&ldata->atomic_read_lock){....}-{3:3}, at: n_tty_read+0xc8/0xa50
>>>
>>>
>>>> +
>>>> +    ret = get_user_pages_fast(region->req.gaddr & PAGE_MASK, 1, 0, &gpage);
>>>> +    if (ret <= 0) {
>>>> +        count = -EIO;
>>> ... snip ...
>>>
>>