lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <cover.1760487869.git.nicolinc@nvidia.com>
Date: Tue, 14 Oct 2025 17:29:32 -0700
From: Nicolin Chen <nicolinc@...dia.com>
To: <jgg@...dia.com>, <kevin.tian@...el.com>
CC: <robin.murphy@....com>, <joro@...tes.org>, <will@...nel.org>,
	<iommu@...ts.linux.dev>, <linux-kernel@...r.kernel.org>, <shuah@...nel.org>,
	<linux-kselftest@...r.kernel.org>, <shyamsaini@...ux.microsoft.com>
Subject: [PATCH v2 0/7] iommufd: Add MSI mapping support with nested SMMU (Part-2 RMR)

[ Background ]
On ARM GIC systems and others, the target address of the MSI is translated
by the IOMMU. For GIC, the MSI address page is called "ITS" page. When the
IOMMU is disabled, the MSI address is programmed to the physical location
of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS
page is behind the IOMMU, so the MSI address is programmed to an allocated
IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to
the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000).
When a 2-stage translation is enabled, IOVA will be still used to program
the MSI address, though the mappings will be in two stages:
  IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000)
(IPA stands for Intermediate Physical Address).

If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, the
IOVA is dynamically allocated from the top of the IOVA space. If attached
to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the IOVA is
fixed to an MSI window reported by the IOMMU driver via IOMMU_RESV_SW_MSI,
which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM IOMMUs.

So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge
of the IOMMU translation (1-stage translation), since the IOVA for the ITS
page is fixed and known by kernel. However, with virtual machine enabling
a nested IOMMU translation (2-stage), a guest kernel directly controls the
stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at an
IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host
kernel can't know that guest-level IOVA to program the MSI address.

There have been two approaches to solve this problem:
1. Create an identity mapping in the stage-1. VMM could insert a few RMRs
   (Reserved Memory Regions) in guest's IORT. Then the guest kernel would
   fetch these RMR entries from the IORT and create an IOMMU_RESV_DIRECT
   region per iommu group for a direct mapping. Eventually, the mappings
   would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000
   This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA.
2. Forward the guest-level MSI IOVA captured by VMM to the host-level GIC
   driver, to program the correct MSI IOVA. Forward the VMM-defined vITS
   page location (IPA) to the kernel for the stage-2 mapping. Eventually:
   IOVA (0xFFFF0000) ===> IPA (0x80900000) ===> PA (0x20200000)
   This requires a VFIO ioctl (for IOVA) and an IOMMUFD ioctl (for IPA).

Worth mentioning that when Eric Auger was working on the same topic with
the VFIO iommu uAPI, he had the approach (2) first, and then switched to
the approach (1), suggested by Jean-Philippe for reduction of complexity.

The approach (1) basically feels like the existing VFIO passthrough that
has a 1-stage mapping for the unmanaged domain, yet only by shifting the
MSI mapping from stage 1 (guest-has-no-iommu case) to stage 2 (guest-has-
iommu case). So, it could reuse the existing IOMMU_RESV_SW_MSI piece, by
sharing the same idea of "VMM leaving everything to the kernel".

The approach (2) is an ideal solution, yet it requires additional effort
for kernel to be aware of the 1-stage gIOVA(s) and 2-stage IPAs for vITS
page(s), which demands VMM to closely cooperate.
 * It also brings some complicated use cases to the table where the host
   or/and guest system(s) has/have multiple ITS pages.

[ Execution ]
The iommu core rework (part-1) for iommufd_sw_msi is merged. So, now the
IOMMU_RESV_SW_MSI can be used as an ABI. VMM can take this hard coded MSI
window and create a direct stage-1 mapping using RMR in the guest's IORT.
However, a proper uAPI must be defined for kernel and VMM to agree on wrt
this virtual MSI window.

Moreover, some use cases might want to map the IOVAs in IOMMU_RESV_SW_MSI
for something else. This requires kernel to provide an interface to shift
the software MSI window to a different region:
https://lore.kernel.org/all/20250909154600.910110-1-shyamsaini@linux.microsoft.com/

This series, as a follow-up series, introduces a pair of iommufd options
for user space to configure the software MSI window.

[ Future Plan ]
Part-3 and beyond will continue the effort of supporting the approach (2)
for a complete vITS-to-pITS mapping:
 1) Map the phsical ITS page (potentially via IOMMUFD_CMD_IOAS_MAP_MSI)
 2) Convey the IOVAs per-irq (potentially via VFIO_IRQ_SET_ACTION_PREPARE)
    Note that the set_option uAPI in this series might not fit since this
    requires it is an array of MSI IOVAs.)

This series is on github:
https://github.com/nicolinc/iommufd/commits/iommufd_msi_p2-v2
Pairing QEMU branch for testing (approach 1):
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_msi_p2-v2-rmr

Changelog
v2
 * Rebase on v6.18-rc1
 * Update commit logs and kdocs
 * Add a patch fixing iommufd_device_is_attached()
 * Add sanity check for overflow and cover it in the selftest
v1 (containing part-1 that is now merged)
 https://lore.kernel.org/all/cover.1739005085.git.nicolinc@nvidia.com/

Thanks!
Nicolin

Nicolin Chen (7):
  iommufd/device: Move sw_msi_start from igroup to idev
  iommufd: Pass in idev to iopt_table_enforce_dev_resv_regions
  iommufd/device: Make iommufd_device_is_attached non-static
  iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  iommufd/selftest: Add MOCK_FLAGS_DEVICE_NO_ATTACH
  iommufd/selftest: Add a testing reserved region
  iommufd/selftest: Add coverage for IOMMU_OPTION_SW_MSI_START/SIZE

 drivers/iommu/iommufd/iommufd_private.h       |   7 +-
 drivers/iommu/iommufd/iommufd_test.h          |   4 +
 include/uapi/linux/iommufd.h                  |  21 +++-
 drivers/iommu/iommufd/device.c                |  43 +++----
 drivers/iommu/iommufd/driver.c                |   4 +-
 drivers/iommu/iommufd/io_pagetable.c          |  18 ++-
 drivers/iommu/iommufd/ioas.c                  | 113 ++++++++++++++++++
 drivers/iommu/iommufd/main.c                  |   4 +
 drivers/iommu/iommufd/selftest.c              |  35 +++++-
 tools/testing/selftests/iommu/iommufd.c       | 105 ++++++++++++++++
 .../selftests/iommu/iommufd_fail_nth.c        |  21 ++++
 11 files changed, 339 insertions(+), 36 deletions(-)

-- 
2.43.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ