linux-kernel - Re: [PATCH v4 1/2] iommu/s390: Fix race with release

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <62637e13-7680-28ff-d395-10e6232fd3e6@linux.ibm.com>
Date:   Fri, 2 Sep 2022 14:20:56 -0400
From:   Matthew Rosato <mjrosato@...ux.ibm.com>
To:     Jason Gunthorpe <jgg@...dia.com>
Cc:     Robin Murphy <robin.murphy@....com>, iommu@...ts.linux.dev,
        Alex Williamson <alex.williamson@...hat.com>,
        linux-s390@...r.kernel.org, schnelle@...ux.ibm.com,
        pmorel@...ux.ibm.com, borntraeger@...ux.ibm.com, hca@...ux.ibm.com,
        gor@...ux.ibm.com, gerald.schaefer@...ux.ibm.com,
        agordeev@...ux.ibm.com, svens@...ux.ibm.com, joro@...tes.org,
        will@...nel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 1/2] iommu/s390: Fix race with release_device ops

On 9/2/22 1:21 PM, Jason Gunthorpe wrote:
> On Fri, Sep 02, 2022 at 01:11:09PM -0400, Matthew Rosato wrote:
>> On 9/1/22 4:37 PM, Jason Gunthorpe wrote:
>>> On Thu, Sep 01, 2022 at 12:14:24PM -0400, Matthew Rosato wrote:
>>>> On 9/1/22 6:25 AM, Robin Murphy wrote:
>>>>> On 2022-08-31 21:12, Matthew Rosato wrote:
>>>>>> With commit fa7e9ecc5e1c ("iommu/s390: Tolerate repeat attach_dev
>>>>>> calls") s390-iommu is supposed to handle dynamic switching between IOMMU
>>>>>> domains and the DMA API handling.  However, this commit does not
>>>>>> sufficiently handle the case where the device is released via a call
>>>>>> to the release_device op as it may occur at the same time as an opposing
>>>>>> attach_dev or detach_dev since the group mutex is not held over
>>>>>> release_device.  This was observed if the device is deconfigured during a
>>>>>> small window during vfio-pci initialization and can result in WARNs and
>>>>>> potential kernel panics.
>>>>>
>>>>> Hmm, the more I think about it, something doesn't sit right about this whole situation... release_device is called via the notifier from device_del() after the device has been removed from its parent bus and largely dismantled; it should definitely not still have a driver bound by that point, so how is VFIO doing things that manage to race at all?
>>>>>
>>>>> Robin.
>>>>
>>>> So, I generally have seen the issue manifest as one of the calls
>>>> into the iommu core from __vfio_group_unset_container
>>>> (e.g. iommu_deatch_group via vfio_type1_iommu) failing with a WARN.
>>>> This happens when the vfio group fd is released, which could be
>>>> coming e.g. from a userspace ioctl VFIO_GROUP_UNSET_CONTAINER.
>>>> AFAICT there's nothing serializing the notion of calling into the
>>>> iommu core here against a device that is simultaneously going
>>>> through release_device (because we don't enter release_device with
>>>> the group mutex held), resulting in unpredictable behavior between
>>>> the dueling attach_dev/detach_dev and the release_device for
>>>> s390-iommu at least.
>>>
>>> Oh, this is a vfio bug.
>>
>> I've been running with your diff applied today on s390 and this
>> indeed fixes the issue by preventing the detach-after-release coming
>> out of vfio. 
> 
> Heh, I'm shocked it worked at all
> 
> I've been trying to understand Robin's latest remarks because maybe I
> don't really understand your situation right.
> 
> IMHO this is definately a VFIO bug, because in a single-device group
> we must not allow the domain to remain attached past remove(). Or more
> broadly we shouldn't be holding ownership of a group without also
> having a driver attached.> 
> But this dicussion with Robin about multi-device groups and hotplug
> makes me wonder what your situation is? There is certainly something
> interesting there too, and this can't be a solution to that problem.
> 

It's just a single-device group, that's all we do in s390 today.  Of course, as we move forward to consume s390-iommu for other use-cases (e.g. Niklas DMA conversion) then yeah I think we will still need something within s390-iommu to handle competing e.g. attach/detach and release_device calls.

For this particular issue, the scenario is triggered by deconfiguring that one host device (effectively, pull the plug) while the device is passed to QEMU via vfio-pci.  

>> Can you send as a patch for review?
> 
> After I wrote this I had a better idea, to avoid the completion and
> just fully orphan the group fd.
> 
> And the patch is kind of messy
> 
> Can you forward me the backtrace you hit also?

It manifests in a few different ways (depends on the timing of the iommu_detach_group vs release_device), but this is by far the most common:

[  402.718355] iommu driver failed to attach the default/blocking domain
[  402.718380] WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
[  402.718389] Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
[  402.718439] CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
[  402.718442] Hardware name: IBM 3931 A01 782 (LPAR)
[  402.718443] Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
[  402.718447]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[  402.718450] Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
[  402.718453]            00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
[  402.718455]            00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
[  402.718457]            00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
[  402.718464] Krnl Code: 000000095bb10d18: c020003d56fc	larl	%r2,000000095c2bbb10
                          000000095bb10d1e: c0e50019d901	brasl	%r14,000000095be4bf20
                         #000000095bb10d24: af000000		mc	0,0
                         >000000095bb10d28: b904002a		lgr	%r2,%r10
                          000000095bb10d2c: ebaff0a00004	lmg	%r10,%r15,160(%r15)
                          000000095bb10d32: c0f4001aa867	brcl	15,000000095be65e00
                          000000095bb10d38: c004002168e0	brcl	0,000000095bf3def8
                          000000095bb10d3e: eb6ff0480024	stmg	%r6,%r15,72(%r15)
[  402.718532] Call Trace:
[  402.718534]  [<000000095bb10d28>] iommu_detach_group+0x70/0x80 
[  402.718538] ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
[  402.718540]  [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1] 
[  402.718546]  [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio] 
[  402.718552]  [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio] 
[  402.718557] pci 0004:00:00.0: Removing from iommu group 4
[  402.718555]  [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100 
[  402.718562]  [<000000095be5d3b4>] __do_syscall+0x1d4/0x200 
[  402.718566]  [<000000095be6c072>] system_call+0x82/0xb0 
[  402.718570] Last Breaking-Event-Address:
[  402.718571]  [<000000095be4bf80>] __warn_printk+0x60/0x68


The WARN is hit because the attach fails due to the simultaneously in-flight release_device.

The subject patch was basically attempting to serialize the value that was being stomped over within s390-iommu (zdev->s390_domain) as a result of the competing iommu_detach_group + release_device.  By ensuring the vfio group is cleaned up before the release, I no longer see the competing threads, with release_device happening after the group is cleaned up. 

> 
> (Though I'm not sure I can get to this promptly, I have only 4 working
> days before LPC and still many things to do)
> 
> Thanks,
> Jason