linux-kernel - Re: [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aEiXXYdhyeqcNiHX@nvidia.com>
Date: Tue, 10 Jun 2025 13:36:45 -0700
From: Nicolin Chen <nicolinc@...dia.com>
To: Jason Gunthorpe <jgg@...dia.com>
CC: Robin Murphy <robin.murphy@....com>, <joro@...tes.org>, <will@...nel.org>,
	<bhelgaas@...gle.com>, <iommu@...ts.linux.dev>,
	<linux-kernel@...r.kernel.org>, <linux-pci@...r.kernel.org>,
	<patches@...ts.linux.dev>, <pjaroszynski@...dia.com>, <vsethi@...dia.com>
Subject: Re: [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets

On Tue, Jun 10, 2025 at 01:30:45PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 10, 2025 at 04:37:58PM +0100, Robin Murphy wrote:
> > On 2025-06-09 7:45 pm, Nicolin Chen wrote:
> > > Hi all,
> > > 
> > > Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software should disable ATS
> > > before initiating a Function Level Reset, and then ensure no invalidation
> > > requests being issued to a device when its ATS capability is disabled.
> > 
> > Not really - what it says is that software should not expect to receive
> > invalidate completions from a function which is in the process of being
> > reset or powered off, and if software doesn't want to be confused by that
> > then it should take care to wait for completion or timeout of all
> > outstanding requests, and avoid issuing new requests, before initiating such
> > a reset or power transition.
> 
> The commit message can be more precise, but I agree with the
> conclusion that the right direction for Linux is to disable and block
> ATS, instead of trying to ignore completion time out events, or trying
> to block page table mutations. Ie do what the implementation note
> says..
> 
> Maybe:
> 
> PCIe permits a device to ignore ATS invalidation TLPs while it is
> processing FLR. This creates a problem visible to the OS where ATS
> invalidation commands will time out. For instance a SVA domain will
> have no coordination with a FLR event and can racily issue ATC
> invalidations into a resetting device.
> 
> The OS should do something to mitigate this as we do not want
> production systems to be reporting critical ATS failures, especially
> in a hypervisor environment. Broadly the OS could arrange to ignore
> the timeout, block page table mutations to prevent invalidations, or
> disable and block ATS.
> 
> The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable
> and block ATS, and we already have iommu driver support to implement
> something like this. Implement this approach in the iommu core.
> 
> Provide a callback from the PCI subsystem that will enclose the FLR
> and have the iommu core temporarily change all the domain attachments
> into BLOCKED. When attaching a BLOCKED domain IOMMU drivers should
> fence any incoming ATS queries, synchronously stop issuing new ATS
> invalidations, and synchronously wait for all ATS invalidations to
> complete. This will avoid any ATS invaliation time outs.
> 
> IOMMU drivers may also disable ATS in PCI config space, but it is not
> required to solve the completion timeout problem. The PCI FLR logic
> will put all the iommu owned config space bits back before completing.
> 
> During this period holding the group mutex will not allow new domains
> to be attached to prevent any new ATS invalidations.

Will pick this writing.

> > I guess I can see how messing with the domain attachment
> > underneath the rest of the group manages to prevent new invalidate requests
> > from group->domain being issued to the given function, but it's pretty
> > horrid - leaving the mutex blocked might be just about tolerable for an FLR
> > that's supposed to take no longer than 100ms, but what if we do want to do
> > this for suspend/resume as well?
> 
> I don't view this a problem for FLR, we can hold a mutex for a long
> time. It principally delays domain changes which are kind of nonsense
> to be doing concurrently with FLR in the first place..
> 
> However, for suspend, we probably want to leave a marker in the group

IIUIC, the thing for suspend/resume is that it would result in a
long hold of the mutex, which can be a problem?

> that the group is force-blocked and all domain attach/detach logic
> will only update the group tracking structures and not call into the
> iommu driver. When the resume happens the core will set the current
> group domain list to the iommu driver. No need for a long lived lock
> this way.

Yea, what we don't want is driver re-enabling ATS. So, bypassing
it at the core level should work. Then, iommu_dev_reset_prepare
and iommu_dev_reset_done will only mutex the flag.

Thanks
Nicolin