linux-kernel - RE: [PATCH v7 12/12] iommu: Improve iopf_queue_flush

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BN9PR11MB5276908231BA164E4AF8806F8C85A@BN9PR11MB5276.namprd11.prod.outlook.com>
Date:   Tue, 5 Dec 2023 01:32:26 +0000
From:   "Tian, Kevin" <kevin.tian@...el.com>
To:     Jason Gunthorpe <jgg@...pe.ca>
CC:     Baolu Lu <baolu.lu@...ux.intel.com>,
        Joerg Roedel <joro@...tes.org>,
        "Will Deacon" <will@...nel.org>,
        Robin Murphy <robin.murphy@....com>,
        Jean-Philippe Brucker <jean-philippe@...aro.org>,
        Nicolin Chen <nicolinc@...dia.com>,
        "Liu, Yi L" <yi.l.liu@...el.com>,
        Jacob Pan <jacob.jun.pan@...ux.intel.com>,
        "Zhao, Yan Y" <yan.y.zhao@...el.com>,
        "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v7 12/12] iommu: Improve iopf_queue_flush_dev()

> From: Jason Gunthorpe <jgg@...pe.ca>
> Sent: Monday, December 4, 2023 9:25 PM
> 
> On Mon, Dec 04, 2023 at 05:37:13AM +0000, Tian, Kevin wrote:
> > > From: Baolu Lu <baolu.lu@...ux.intel.com>
> > > Sent: Monday, December 4, 2023 9:33 AM
> > >
> > > On 12/3/23 10:14 PM, Jason Gunthorpe wrote:
> > > > On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote:
> > > >> Even if atomic replacement were to be implemented,
> > > >> it would be necessary to ensure that all translation requests,
> > > >> translated requests, page requests and responses for the old domain
> are
> > > >> drained before switching to the new domain.
> > > >
> > > > Again, no it isn't required.
> > > >
> > > > Requests simply have to continue to be acked, it doesn't matter if
> > > > they are acked against the wrong domain because the device will simply
> > > > re-issue them..
> > >
> > > Ah! I start to get your point now.
> > >
> > > Even a page fault response is postponed to a new address space, which
> > > possibly be another address space or hardware blocking state, the
> > > hardware just retries.
> >
> > if blocking then the device shouldn't retry.
> 
> It does retry.
> 
> The device is waiting on a PRI, it gets back an completion. It issues
> a new ATS (this is the rety) and the new-domain responds back with a
> failure indication.

I'm not sure that is the standard behavior defined by PCIe spec.

According to "10.4.2 Page Request Group Response Message", function's
response to Page Request failure is implementation specific.

so a new ATS is optional and likely the device will instead abort the DMA
if PRI response already indicates a failure.

> 
> If the new domain had a present page it would respond with a
> translation
> 
> If the new domain has a non-present page then we get a new PRI.
> 
> The point is from a device perspective it is always doing something
> correct.
> 
> > btw if a stale request targets an virtual address which is outside of the
> > valid VMA's of the new address space then visible side-effect will
> > be incurred in handle_mm_fault() on the new space. Is it desired?
> 
> The whole thing is racy, if someone is radically changing the
> underlying mappings while DMA is ongoing then there is no way to
> synchronize 'before' and 'after' against a concurrent external device.
> 
> So who cares?
> 
> What we care about is that the ATC is coherent and never has stale
> data. The invalidation after changing the translation ensures this
> regardless of any outstanding un-acked PRI.
> 
> > Or if a pending response carries an error code (Invalid Request) from
> > the old address space is received by the device when the new address
> > space is already activated, the hardware will report an error even
> > though there might be a valid mapping in the new space.
> 
> Again, all racy. If a DMA is ongoing at the same instant things are
> changed there is no definitive way to say if it resolved before or
> after.
> 
> The only thing we care about is that dmas that are completed before
> see the before translation and dmas that are started after see the
> after translation.
> 
> DMAs that cross choose one at random.

Yes that makes sense for replacement.

But here we are talking about a draining requirement when disabling
a pasid entry, which is certainly not involved in replacement.

> 
> > I don't think atomic replace is the main usage for this draining
> > requirement. Instead I'm more interested in the basic popular usage:
> > attach-detach-attach and not convinced that no draining is required
> > between iommu/device to avoid interference between activities
> > from old/new address space.
> 
> Something like IDXD needs to halt DMAs on the PASID and flush all
> outstanding DMA to get to a state where the PASID is quiet from the
> device perspective. This is the only way to stop interference.

why is it IDXD specific behavior? I suppose all devices need to quiesce
the outstanding DMAs when tearing down the binding between the
PASID and previous address space.

and here what you described is the normal behavior. In this case
I agree that no draining is required in iommu side given the device
should have quiesced all outstanding DMAs including page requests.

but there are also terminal conditions e.g. when a workqueue is
reset after hang hence additional draining is required from the 
iommu side to ensure all the outstanding page requests/responses
are properly handled.

vt-d spec defines a draining process to cope with those terminal
conditions (see 7.9 Pending Page Request Handling on Terminal
Conditions). intel-iommu driver just implements it by default for
simplicity (one may consider providing explicit API for drivers to
call but not sure of the necessity if such terminal conditions
apply to most devices). anyway this is not a fast path.

another example might be stop marker. A device using stop marker
doesn't need to wait for outstanding page requests. According to PCIe
spec (10.4.1.2 Managing PASID Usage on PRG Requests) the device
simply marks outstanding page request as stale and sends a stop
marker message to the IOMMU. Page responses for those stale
requests are ignored. But presumably the iommu driver still needs
to drain those requests until the stop marker message in unbind
to avoid them incorrectly routed to a new address space in case the
PASID is rebound to another process immediately.

> 
> If the device is still issuing DMA after the domain changes then it is
> never going to work right.
> 
> If *IDXD* needs some help to flush PRIs after it halts DMAs (because
> it can't do it on its own for some reason) then IDXD should have an
> explicit call to do that, after suspending new DMA.

as above I don't think IDXD itself has any special requirements. We
are discussing general device terminal conditions which are considered
by the iommu spec.

> 
> We don't know what things devices will need to do here, devices that
> are able to wait for PRIs to complete may want a cancelling flush to
> speed that up, and that shouldn't be part of the translation change.
> 
> IOW the act of halting DMA and the act of changing the translation
> really should be different things. Then we get into interesting
> questions like what sequence is required for a successful FLR. :\
> 
> Jason