lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250610163045.GI543171@nvidia.com>
Date: Tue, 10 Jun 2025 13:30:45 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Robin Murphy <robin.murphy@....com>
Cc: Nicolin Chen <nicolinc@...dia.com>, joro@...tes.org, will@...nel.org,
	bhelgaas@...gle.com, iommu@...ts.linux.dev,
	linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org,
	patches@...ts.linux.dev, pjaroszynski@...dia.com, vsethi@...dia.com
Subject: Re: [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets

On Tue, Jun 10, 2025 at 04:37:58PM +0100, Robin Murphy wrote:
> On 2025-06-09 7:45 pm, Nicolin Chen wrote:
> > Hi all,
> > 
> > Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software should disable ATS
> > before initiating a Function Level Reset, and then ensure no invalidation
> > requests being issued to a device when its ATS capability is disabled.
> 
> Not really - what it says is that software should not expect to receive
> invalidate completions from a function which is in the process of being
> reset or powered off, and if software doesn't want to be confused by that
> then it should take care to wait for completion or timeout of all
> outstanding requests, and avoid issuing new requests, before initiating such
> a reset or power transition.

The commit message can be more precise, but I agree with the
conclusion that the right direction for Linux is to disable and block
ATS, instead of trying to ignore completion time out events, or trying
to block page table mutations. Ie do what the implementation note
says..

Maybe:

PCIe permits a device to ignore ATS invalidation TLPs while it is
processing FLR. This creates a problem visible to the OS where ATS
invalidation commands will time out. For instance a SVA domain will
have no coordination with a FLR event and can racily issue ATC
invalidations into a resetting device.

The OS should do something to mitigate this as we do not want
production systems to be reporting critical ATS failures, especially
in a hypervisor environment. Broadly the OS could arrange to ignore
the timeout, block page table mutations to prevent invalidations, or
disable and block ATS.

The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable
and block ATS, and we already have iommu driver support to implement
something like this. Implement this approach in the iommu core.

Provide a callback from the PCI subsystem that will enclose the FLR
and have the iommu core temporarily change all the domain attachments
into BLOCKED. When attaching a BLOCKED domain IOMMU drivers should
fence any incoming ATS queries, synchronously stop issuing new ATS
invalidations, and synchronously wait for all ATS invalidations to
complete. This will avoid any ATS invaliation time outs.

IOMMU drivers may also disable ATS in PCI config space, but it is not
required to solve the completion timeout problem. The PCI FLR logic
will put all the iommu owned config space bits back before completing.

During this period holding the group mutex will not allow new domains
to be attached to prevent any new ATS invalidations.

> > Therefore, there needs to be a sync between IOMMU and PCI subsystems, to
> > ensure that ATS will be disabled and never gets re-enabled until the FLR
> > finishes. Add a pair of new IOMMU helpers for PCI reset functions to call
> > before and after the reset routines. These two helpers will temporally
> > attach the device's RID/PASID to IOMMU_DOMAIN_BLOCKED, which should allow
> > its IOMMU driver to pause any DMA traffic and disable ATS feature until
> > the FLR is done.
> 
> FLR must inherently stop the function from issuing any kind of requests
> anyway (see 6.6.2), so "pausing" traffic at the IOMMU end seems like a
> non-issue. 

Yeah..  I haven't liked this blocking domain approach too much, but I
was convinced..

It has the strong advantage of being simple to implement and not
requiring any special cases or code in the driver. We don't actually
need the driver to disable ATS in PCIe config space, it only needs to
stop sending invalidations and fail all new ATS requests. All existing
drivers inherently do this already for blocked domains, so this should
solve the problem for everyone.

> I guess I can see how messing with the domain attachment
> underneath the rest of the group manages to prevent new invalidate requests
> from group->domain being issued to the given function, but it's pretty
> horrid - leaving the mutex blocked might be just about tolerable for an FLR
> that's supposed to take no longer than 100ms, but what if we do want to do
> this for suspend/resume as well?

I don't view this a problem for FLR, we can hold a mutex for a long
time. It principally delays domain changes which are kind of nonsense
to be doing concurrently with FLR in the first place..

However, for suspend, we probably want to leave a marker in the group
that the group is force-blocked and all domain attach/detach logic
will only update the group tracking structures and not call into the
iommu driver. When the resume happens the core will set the current
group domain list to the iommu driver. No need for a long lived lock
this way.

Jason

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ