[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250918214425.2677057-1-amastro@fb.com>
Date: Thu, 18 Sep 2025 14:44:07 -0700
From: Alex Mastro <amastro@...com>
To: Alex Williamson <alex.williamson@...hat.com>,
Jason Gunthorpe
<jgg@...pe.ca>, Kevin Tian <kevin.tian@...el.com>
CC: Bjorn Helgaas <bhelgaas@...gle.com>, David Reiss <dreiss@...a.com>,
"Joerg
Roedel" <joro@...tes.org>, Keith Busch <kbusch@...nel.org>,
Leon Romanovsky
<leon@...nel.org>, Li Zhe <lizhe.67@...edance.com>,
Mahmoud Adam
<mngyadam@...zon.de>,
Philipp Stanner <pstanner@...hat.com>,
Robin Murphy
<robin.murphy@....com>,
Vivek Kasireddy <vivek.kasireddy@...el.com>,
"Will
Deacon" <will@...nel.org>, Yunxiang Li <Yunxiang.Li@....com>,
<linux-kernel@...r.kernel.org>, <iommu@...ts.linux.dev>,
<kvm@...r.kernel.org>
Subject: [TECH TOPIC] vfio, iommufd: Enabling user space drivers to vend more granular access to client processes
Hello,
We've been running user space drivers (USD) in production built on top of VFIO,
and have come to value the operational and development benefits of being able to
deploy updates to device policy by simply shipping user space binaries.
In our architecture, a long-running USD process bootstraps and manages
background device operations related to supporting client process workloads.
Client processes communicate with the USD over Unix domain sockets to acquire
the device resources necessary to dispatch work to the device.
We anticipate a growing need to provide more granular access to device resources
beyond what the kernel currently affords to user space drivers similar to our
model.
The purpose of this email is to:
- Gauge the extent to which ongoing work in VFIO and IOMMUFD can meet those
needs.
- Seek guidance from VFIO and IOMMUFD maintainers about whether there is a path
to supporting the remaining pieces across the VFIO and IOMMUFD UAPIs.
- Describe our current approach and get feedback on whether there are existing
solutions that we've missed.
- Figure out the most useful places we can help contribute.
Inter-process communication latency (between client processes and USD) is
prohibitively slow to support hot-path communication between the client and
device, which targets round-trip times on the order of microseconds. To address
this, we need to allow client processes and the device to access each other's
memory directly, bypassing IPC with the USD and kernel syscalls.
a) For host-initiated access into device memory, this means mmap-ing BAR
sub-regions into the client process.
b) For device-initiated access into host memory, it means establishing IOMMU
mappings to memory underlying the client process address space.
Such things are more straightforward for in-kernel device drivers to accomplish:
they are free to define customized semantics for their associated fds and
syscall handlers. Today, user space driver processes have fewer tools at their
disposal for controlling these types of access.
----------
BAR Access
----------
To achieve (a), the USD sends the VFIO device fd to the client over Unix domain
sockets using SCM_RIGHTS, along with descriptions of which device regions are
for what. While this allows the client to mmap BARs into its address space,
it comes at the cost of exposing more access to device BAR regions than is
necessary or appropriate. In our use case, we don't need to contend with
adversarial client processes, so the current situation is tenable, but not
ideal.
Ongoing efforts to add dma-buf exporting to VFIO [1] seem relevant here. Though
its current intent is around enabling peer-to-peer access, the fact that only
a subset of device regions are bound to this fd could be useful for controlling
access granularity to device regions.
Instead of vending the VFIO device fd to the client process, the USD could bind
the necessary BAR regions to a dma-buf fd and share that with the client. If
VFIO supported dma_buf_ops.mmap, the client could mmap those into its address
space.
Adding such capability would mean that there would be two paths for mmap-ing
device regions: VFIO device fd and dma-buf fd. I imagine this could be
contentious; people may not like that there are two ways to achieve the same
thing. It also seems complicated by the fact that there are ongoing discussions
about how to extend the VFIO device fd UAPI to support features like write
combining [2]. It would feel incomplete for such features to be available
through one mmap-ing modality but not the other. This would have implications
for how the “special regions” should be communicated across the UAPI.
The VFIO dma-buf UAPI currently being proposed [3] takes a region_index and an
array of (offset, length) intervals within the region to assign to the dma-buf.
>From what I can tell, that seems coherent with the latest direction from [2],
which will enable the creation of new region indices with special properties,
which are aliases to the default BAR regions. The USD could theoretically create
a dma-buf backed by "the region index corresponding to write-combined BAR 4" to
share with the client.
Given some of the considerations above, would there be line of sight for adding
support for dma_buf_ops.mmap to VFIO?
-------------
IOMMU Mapping
-------------
To achieve (b), we have been using the (now legacy) VFIO container interface
to manage access to the IOMMU. We understand that new feature development has
moved to IOMMUFD, and intend to migrate to using it when it's ready (we have
some use cases that require P2P). We are not expecting to add features to VFIO
containers. I will describe what we are doing today first.
In order to enable a device to access memory in multiple processes, we also
share the VFIO container fd using SCM_RIGHTS between the USD and client
processes. In this scheme, we partition the I/O address space (IOAS) for a
given device's container to be used cooperatively amongst each process. The
only enforcement of the partitioning convention is that each process only
VFIO_IOMMU_{MAP,UNMAP}_DMA's to the IOVA ranges which have been assigned to it.
When the USD detects that the client process has exited, it is able to unmap any
leftover dirty mappings with VFIO_IOMMU_UNMAP_DMA. This became possible after
[4], which allowed one process to free the mappings created by another process.
That patch's intent was to enable QEMU live update use cases, but benefited our
use case as well.
Again, we don't have to contend with adversarial client processes, so this has
been OK, but not ideal for now.
We are interested in the following incremental capabilities:
- We want the USD to be able to create and vend fds which provide restricted
mapping access to the device's IOAS to the client, while preserving
the ability of the USD to revoke device access to client memory via
VFIO_IOMMU_UNMAP_DMA (or IOMMUFD_CMD_IOAS_UNMAP for IOMMUFD). Alternatively,
to forcefully invalidate the entire restricted IOMMU fd, including mappings.
- It would be nice if mappings created with the restricted IOMMU fd were
automatically freed when the underlying kernel object was freed (if the client
process were to exit ungracefully without explicitly performing unmap cleanup
after itself).
Some of those things sound very similar to the direction of vIOMMU, but it is
difficult to tell if that could meet our needs exactly. The kinds of features
I think we want should be achievable purely in software without any dedicated
hardware support.
This is an area we are less familiar with, since we haven't been living on the
IOMMUFD UAPI or following its development as closely yet. Perhaps we have missed
something more obvious?
Overall, I'm curious to hear feedback on this. Allowing user space drivers
to vend more granular device access would certainly benefit our use case, and
perhaps others as well.
[1] https://lore.kernel.org/all/cover.1754311439.git.leon@kernel.org/
[2] https://lore.kernel.org/all/20250804104012.87915-1-mngyadam@amazon.de/
[3] https://lore.kernel.org/all/5e043d8b95627441db6156e7f15e6e1658e9d537.1754311439.git.leon@kernel.org/
[4] https://lore.kernel.org/all/20220627035109.73745-1-lizhe.67@bytedance.com/
Thanks,
Alex
Powered by blists - more mailing lists