linux-kernel - Re: virtio-iommu hotplug issue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c6fb5a06-aa7e-91f9-7001-f456b2769595@daynix.com>
Date:   Thu, 13 Apr 2023 20:01:54 +0900
From:   Akihiko Odaki <akihiko.odaki@...nix.com>
To:     Jean-Philippe Brucker <jean-philippe@...aro.org>
Cc:     Eric Auger <eric.auger@...hat.com>,
        virtio-dev@...ts.oasis-open.org,
        virtualization@...ts.linux-foundation.org,
        linux-kernel@...r.kernel.org, qemu-devel@...gnu.org
Subject: Re: virtio-iommu hotplug issue

On 2023/04/13 19:40, Jean-Philippe Brucker wrote:
> Hello,
> 
> On Thu, Apr 13, 2023 at 01:49:43PM +0900, Akihiko Odaki wrote:
>> Hi,
>>
>> Recently I encountered a problem with the combination of Linux's
>> virtio-iommu driver and QEMU when a SR-IOV virtual function gets disabled.
>> I'd like to ask you what kind of solution is appropriate here and implement
>> the solution if possible.
>>
>> A PCIe device implementing the SR-IOV specification exports a virtual
>> function, and the guest can enable or disable it at runtime by writing to a
>> configuration register. This effectively looks like a PCI device is
>> hotplugged for the guest.
> 
> Just so I understand this better: the guest gets a whole PCIe device PF
> that implements SR-IOV, and so the guest can dynamically create VFs?  Out
> of curiosity, is that a hardware device assigned to the guest with VFIO,
> or a device emulated by QEMU?

Yes, that's right. The guest can dynamically create and delete VFs. The 
device is emulated by QEMU: igb, an Intel NIC recently added to QEMU and 
projected to be released as part of QEMU 8.0.

> 
>> In such a case, the kernel assumes the endpoint is
>> detached from the virtio-iommu domain, but QEMU actually does not detach it.
>>
>> This inconsistent view of the removed device sometimes prevents the VM from
>> correctly performing the following procedure, for example:
>> 1. Enable a VF.
>> 2. Disable the VF.
>> 3. Open a vfio container.
>> 4. Open the group which the PF belongs to.
>> 5. Add the group to the vfio container.
>> 6. Map some memory region.
>> 7. Close the group.
>> 8. Close the vfio container.
>> 9. Repeat 3-8
>>
>> When the VF gets disabled, the kernel assumes the endpoint is detached from
>> the IOMMU domain, but QEMU actually doesn't detach it. Later, the domain
>> will be reused in step 3-8.
>>
>> In step 7, the PF will be detached, and the kernel thinks there is no
>> endpoint attached and the mapping the domain holds is cleared, but the VF
>> endpoint is still attached and the mapping is kept intact.
>>
>> In step 9, the same domain will be reused again, and the kernel requests to
>> create a new mapping, but it will conflict with the existing mapping and
>> result in -EINVAL.
>>
>> This problem can be fixed by either of:
>> - requesting the detachment of the endpoint from the guest when the PCI
>> device is unplugged (the VF is disabled)
> 
> Yes, I think this is an issue in the virtio-iommu driver, which should be
> sending a DETACH request when the VF is disabled, likely from
> viommu_release_device(). I'll work on a fix unless you would like to do it

It will be nice if you prepare a fix. I will test your patch with my 
workload if you share it with me.

Regards,
Akihiko Odaki

> 
>> - detecting that the PCI device is gone and automatically detach it on
>> QEMU-side.
>>
>> It is not completely clear for me which solution is more appropriate as the
>> virtio-iommu specification is written in a way independent of the endpoint
>> mechanism and does not say what should be done when a PCI device is
>> unplugged.
> 
> Yes, I'm not sure it's in scope for the specification, it's more about
> software guidance
> 
> Thanks,
> Jean