lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250922235651.GG1391379@nvidia.com>
Date: Mon, 22 Sep 2025 20:56:51 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Andrew Jones <ajones@...tanamicro.com>
Cc: iommu@...ts.linux.dev, kvm-riscv@...ts.infradead.org,
	kvm@...r.kernel.org, linux-riscv@...ts.infradead.org,
	linux-kernel@...r.kernel.org, zong.li@...ive.com,
	tjeznach@...osinc.com, joro@...tes.org, will@...nel.org,
	robin.murphy@....com, anup@...infault.org, atish.patra@...ux.dev,
	tglx@...utronix.de, alex.williamson@...hat.com,
	paul.walmsley@...ive.com, palmer@...belt.com, alex@...ti.fr
Subject: Re: [RFC PATCH v2 08/18] iommu/riscv: Use MSI table to enable IMSIC
 access

On Mon, Sep 22, 2025 at 04:20:43PM -0500, Andrew Jones wrote:
> On Mon, Sep 22, 2025 at 03:43:36PM -0300, Jason Gunthorpe wrote:
> > On Sat, Sep 20, 2025 at 03:38:58PM -0500, Andrew Jones wrote:
> > > When setting irq affinity extract the IMSIC address the device
> > > needs to access and add it to the MSI table. If the device no
> > > longer needs access to an IMSIC then remove it from the table
> > > to prohibit access. This allows isolating device MSIs to a set
> > > of harts so we can now add the IRQ_DOMAIN_FLAG_ISOLATED_MSI IRQ
> > > domain flag.
> > 
> > IRQ_DOMAIN_FLAG_ISOLATED_MSI has nothing to do with HARTs.
> > 
> >  * Isolated MSI means that HW modeled by an irq_domain on the path from the
> >  * initiating device to the CPU will validate that the MSI message specifies an
> >  * interrupt number that the device is authorized to trigger. This must block
> >  * devices from triggering interrupts they are not authorized to trigger.
> >  * Currently authorization means the MSI vector is one assigned to the device.
> 
> Unfortunately the RISC-V IOMMU doesn't have support for this. I've raised
> the lack of MSI data validation to the spec writers and I'll try to raise
> it again, but I was hoping we could still get IRQ_DOMAIN_FLAG_ISOLATED_MSI
> by simply ensuring the MSI addresses only include the affined harts (and
> also with the NOTE comment I've put in this patch to point out the
> deficiency).

Nope, sorry, it has to validate the incoming target Linux interrupt
vector in some way - that is the whole point and the required
security.

We cannot allow devices assigned to a VFIO to trigger vectors that
belong to devices outside that VFIO. HARTS should have nothing to do
with that security statement.

> > It has to do with each PCI BDF having a unique set of
> > validation/mapping tables for MSIs that are granular to the interrupt
> > number.
> 
> Interrupt numbers (MSI data) aren't used by the RISC-V IOMMU in any way.

Interrupt number is a Linux concept, HW decodes the addr/data pair and
delivers it to some Linux interrupt. Linux doesn't care how the HW
treats the addr/data pair, it can ignore data if it wants.

> 1. stage1 not BARE
> ------------------
> 
>       stage1     MSI table
>  IOVA ------> A  ---------> host-MSI-address

If you run the driver like this then it should use IOMMU_RESV_SW_MSI
to make an aperture. Follow how ARM GIC works for this. The Linux
architecture is that the iommu_domain owner is responsible to provide
a range for SW_MSI and connect it to the interrupt driver.

> 2. stage1 is BARE, for example if only stage2 is in use
> -------------------------------------------------------
> 
>            MSI table
>  IOVA == A ---------> host-MSI-address

And here you'd use HW_MSI to set a fixed aperture that matches
DC.msi_addr_* (?)

> I don't think we can apply a lot of AMD's and Intel's model to RISC-V.

AMD and Intel have the interrupt translation as part of the "DC". That
corresponds to the DC.msiptp related stuff.

ARM has a SW_MSI that applies the S1/S2 translation first and then
puts the interrupt translation in the GIC IP block. That corresponds
to what you just explained above.

Somehow RISCV have the most challenging properties of both
architectures, and mis-understood what security Linux expects out of
interrupt remapping :|

Use SW_MSI to open the aperture like ARM, use techniques like
AMD/Intel to manage the translation table in the shared DC

Don't touch the iommu_domain from the interrupt driver, it simply
cannot work.
 
> > It looks like there is something in here to support HW that doesn't
> > have msiptp? That's different, and also looks very confused.
> 
> The only support is to ensure all the host IMSICs are mapped, otherwise
> we can't turn on IOMMU_DMA since all MSI writes will cause faults. We
> don't set IRQ_DOMAIN_FLAG_ISOLATED_MSI in this case, though, since we
> don't bother unmapping MSI addresses of harts that IRQs have be un-
> affined from.

OK, that seems like a good first series. Implement just the SW_MSI to
expose all the IMSIC's.

> > Instead it probably has to use the SW_MSI mechanism to request mapping
> > the interrupt controller aperture. You don't get
> > IRQ_DOMAIN_FLAG_ISOLATED_MSI with something like this though. Look at
> > how ARM GIC works for this mechanism.
> 
> I'm not seeing how SW_MSI will help here, but so far I've just done some
> quick grepping and code skimming.

SW_MSI is intended to deal with exactly what you are describing - the
IOMMU translates the MSI aperture.

1) The driver declares a IOMMU_RESV_SW_MSI in it's reserved_regions
2) When a subsystem wishes to use an iommu_domain it must detect the
   SW_MSI and setup some IOVA for it to use
   a) dma-iommu.c allocates IOVA dynamically per-page via
      iommu_dma_get_msi_page()
   b) Classic VFIO uses iommu_get_msi_cookie() with the SW_MSI range.
   c) iommufd has its own logic in and around iommufd_sw_msi()
3) When the IRQ driver wants to establish a MSI it computes the
   untranslated address it wants and then calls iommu_dma_prepare_msi()

   iommu_dma_prepare_msi() will return the a msi_desc with the address
   adjusted

   For instance dma-iommu.c has iommu_dma_prepare_msi() call into
   iommu_dma_get_msi_page() which will establish any required mapping
   for phys at some IOVA it picks.

The riscv mess has the additional obnoxious complexity that the S1/S2
are inconsisent.

Right now the only way to deal with this would be to only use one of
the S1 or S2 and make that decision when the iommu driver starts. You
can return the right SW_MSI/HW_MSI based on which PAGING domain style
the driver is going to use.

I recommend if the S2 is available you make the driver always use the
S2 for everything and ignore the S1 except for explicit two stage
translation setup by a hypervisor. Thus always return HW_MSI.

If the iommu does not support S2 then always use S1 and always return
SW_MSI.

Disallow SW from creating a PAGING domain that mismatches the expected
interrupt setup.

There is still quite some mess to enhance the iommu_dma_prepare_msi()
path to support all the modes, that will need new code as ARM doesn't
have the difference between S1/S2 (sigh)

But broadly this is the sketch for where the touchpoints should be to
setup a MSI aperture that has to go through the page table.

> The first five patches plus the "enable IOMMU_DMA" will allow paging
> domains to be used by default, while paving the way for patches 6-8 to
> allow host IRQs to be isolated to the best of our ability (only able to
> access IMSICs to which they are affined). So we could have
> 
> series1: irqdomain + map all imsics + enable IOMMU_DMA

Yes, please lets try to get through this first..

> series2: actually apply irqdomain in order to implement map/unmap of MSI
>          ptes based on IRQ affinity - set IRQ_DOMAIN_FLAG_ISOLATED_MSI,
>          because that's the best we've got...

Make sense, but as above, need to do something to make it actually
implement IRQ_DOMAIN_FLAG_ISOLATED_MSI. Limited to HARTS is not the
same thing.

This also needs some kind of new iommufd support, and I don't know how
it will work, but setting the DC.msi_addr* should come from userspace
and create a new reserved region.. Maybe we need to create per-domain
reserved regions..

> series3: the rest of the patches of this series which introduce irqbypass
>          support for the virt use case

Yes

> Would that be better? Or do you see some need for some patch splits as
> well?

I did not try to look at the patches for this, but I think your three
themes makes alot of sense.

Please include in the cover letter some of the explanations from these
two RFC postings about how the HW works.

Jason

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ