linux-kernel - Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250528140941.151b2f70.alex.williamson@redhat.com>
Date: Wed, 28 May 2025 14:09:41 -0600
From: Alex Williamson <alex.williamson@...hat.com>
To: Jason Gunthorpe <jgg@...pe.ca>
Cc: David Hildenbrand <david@...hat.com>, lizhe.67@...edance.com,
 kvm@...r.kernel.org, linux-kernel@...r.kernel.org, muchun.song@...ux.dev
Subject: Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for
 huge folio

On Tue, 27 May 2025 20:46:27 -0300
Jason Gunthorpe <jgg@...pe.ca> wrote:

> On Tue, May 27, 2025 at 01:52:52PM -0600, Alex Williamson wrote:
> 
> > > Lots of CSPs are running iommufd now. There is a commonly used OOT
> > > patch to add the insecure P2P support like VFIO. I know lots of folks
> > > have backported iommufd.. No idea about libvirt, but you can run it in
> > > compatibility mode and then you don't need to change libvirt.  
> > 
> > For distributions that don't have an upstream first policy, sure, they
> > can patch whatever they like.  I can't recommend that solution though.  
> 
> I appreciate that, and we are working on it.. The first round of
> patches for DMA API improvements that Christoph asked for were sent as
> a PR yesterday.
> 
> > Otherwise the problem with compatibility mode is that it's a compile
> > time choice.  
> 
> The compile time choice is not the compatability mode.
> 
> Any iommufd, even if opened from /dev/iommu, is usable as a VFIO
> container in the classic group based ioctls.
> 
> The group path in VFIO calls vfio_group_ioctl_set_container() ->
> iommufd_ctx_from_file() which works with iommufd from any source.
> 
> The type 1 emulation ioctls are also always available on any iommufd.
> After set container VFIO does iommufd_vfio_compat_ioas_create() to
> setup the default compatability stuff.
> 
> All the compile time option does is replace /dev/vfio/vfio with
> /dev/iommu, but they have *exactly* the same fops:
> 
> static struct miscdevice iommu_misc_dev = {
> 	.minor = MISC_DYNAMIC_MINOR,
> 	.name = "iommu",
> 	.fops = &iommufd_fops,
> 
> static struct miscdevice vfio_misc_dev = {
> 	.minor = VFIO_MINOR,
> 	.name = "vfio",
> 	.fops = &iommufd_fops,
> 
> So you can have libvirt open /dev/iommu, or you can have the admin
> symlink /dev/iommu to /dev/vfio/vfio and opt in on a case by case
> basis.

Yes, I'd forgotten we added this.  It's QEMU opening /dev/vfio/vfio and
QEMU already has native iommufd support, so a per VM hack could be done
via qemu:args or a QEMU wrapper script to instantiate an iommufd object
in the VM xml, or as noted, a system-wide change could be done
transparently via 'ln -sf /dev/iommu /dev/vfio/vfio'.

To be fair to libvirt, we'd really like libvirt to make use of iommufd
whenever it's available, but without feature parity this would break
users.  And without feature parity, it's not clear how libvirt should
probe for feature parity.  Things get a lot easier for libvirt if we
can switch the default at a point where we expect no regressions.

> The compile time choice is really just a way to make testing easier
> and down the road if a distro decides they don't want to support both
> code bases then can choose to disable the type 1 code entirely and
> still be uAPI compatible, but I think that is down the road a ways
> still.

Yep.
 
> > A single kernel binary cannot interchangeably provide
> > either P2P DMA with legacy vfio or better IOMMUFD improvements without
> > P2P DMA.  
> 
> See above, it can, and it was deliberately made easy to do without
> having to change any applications.
> 
> The idea was you can sort of incrementally decide which things to move
> over. For instance you can keep all the type 1 code and vfio
> group/container stuff unchanged but use a combination of
> IOMMU_VFIO_IOAS_GET and then IOMMUFD_CMD_IOAS_MAP_FILE to map a memfd.

Right, sorry, it'd slipped my mind that we'd created the "soft"
compatibility mode too.

Zhe, so if you have no dependencies on P2P DMA within your device
assignment VMs, the options above may be useful or at least a data
point for comparison of type1 vs IOMMUFD performance.  Thanks,

Alex