linux-kernel - Re: [PATCH v4 1/2] iommu/s390: Fix race with release

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <52d3fe0b86bdc04fdbf3aae095b2f71f4ea12d44.camel@linux.ibm.com>
Date:   Thu, 01 Sep 2022 11:37:58 +0200
From:   Niklas Schnelle <schnelle@...ux.ibm.com>
To:     Pierre Morel <pmorel@...ux.ibm.com>,
        Matthew Rosato <mjrosato@...ux.ibm.com>, iommu@...ts.linux.dev
Cc:     linux-s390@...r.kernel.org, borntraeger@...ux.ibm.com,
        hca@...ux.ibm.com, gor@...ux.ibm.com,
        gerald.schaefer@...ux.ibm.com, agordeev@...ux.ibm.com,
        svens@...ux.ibm.com, joro@...tes.org, will@...nel.org,
        robin.murphy@....com, jgg@...dia.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 1/2] iommu/s390: Fix race with release_device ops

On Thu, 2022-09-01 at 09:56 +0200, Pierre Morel wrote:
> 
> On 8/31/22 22:12, Matthew Rosato wrote:
> > With commit fa7e9ecc5e1c ("iommu/s390: Tolerate repeat attach_dev
> > calls") s390-iommu is supposed to handle dynamic switching between IOMMU
> > domains and the DMA API handling.  However, this commit does not
> > sufficiently handle the case where the device is released via a call
> > to the release_device op as it may occur at the same time as an opposing
> > attach_dev or detach_dev since the group mutex is not held over
> > release_device.  This was observed if the device is deconfigured during a
> > small window during vfio-pci initialization and can result in WARNs and
> > potential kernel panics.
> > 
> > Handle this by tracking when the device is probed/released via
> > dev_iommu_priv_set/get().  Ensure that once the device is released only
> > release_device handles the re-init of the device DMA.
> > 
> > Fixes: fa7e9ecc5e1c ("iommu/s390: Tolerate repeat attach_dev calls")
> > Signed-off-by: Matthew Rosato <mjrosato@...ux.ibm.com>
> > ---
> >   arch/s390/include/asm/pci.h |  1 +
> >   arch/s390/pci/pci.c         |  1 +
> >   drivers/iommu/s390-iommu.c  | 39 ++++++++++++++++++++++++++++++++++---
> >   3 files changed, 38 insertions(+), 3 deletions(-)
> > 
> > 
---8<---
> >   
> > @@ -206,10 +221,28 @@ static void s390_iommu_release_device(struct device *dev)
> > 
> 
---8<---
> > +		/* Make sure this device is removed from the domain list */
> >   		domain = iommu_get_domain_for_dev(dev);
> >   		if (domain)
> >   			s390_iommu_detach_device(domain, dev);
> > +		/* Now ensure DMA is initialized from here */
> > +		mutex_lock(&zdev->dma_domain_lock);
> > +		if (zdev->s390_domain) {
> > +			zdev->s390_domain = NULL;
> > +			zpci_unregister_ioat(zdev, 0);
> > +			zpci_dma_init_device(zdev);
> 
> Sorry if it is a stupid question, but two things looks strange to me:
> 
> - having DMA initialized just after having unregistered the IOAT
> Is that really all we need to unregister before calling dma_init_device?
> 
> - having DMA initialized inside the release_device callback:
> Why isn't it done in the device_probe ?

As I understand it iommu_release_device() which calls this code is only
used when a device goes away. So, I think you're right in that it makes
little sense to re-initialize DMA at this point, it's going to be torn
down immediately after anyway. I do wonder if it would be an acceptably
small change to just set zdev->s390_domain = NULL here and leave DMA
uninitialized while making zpci_dma_exit_device() deal with that e.g.
by doing nothing if zdev->dma_table is NULL but I'm not sure.

Either way I fear this mess really is just a symptom of our current
design oddity of driving the same IOMMU hardware through both our DMA
API implementation (arch/s390/pci_dma.c) and the IOMMU driver
(driver/iommu/s390-iommu.c) and trying to hand off between them
smoothly where common code instead just layers one atop the other when
using an IOMMU at all.

I think the correct medium term solution is to use the common DMA API
implementation (drivers/iommu/dma-iommu.c) like everyone else. But that
isn't the minimal fix we need now. 

I do have a working prototype of using the common implementation but
the big problem that I'm still searching a solution for is its
performance with a virtualized IOMMU where IOTLB flushes (RPCIT on
s390) are used for shadowing and are expensive and serialized. The
optimization we used so far for unmap, only doing one global IOTLB
flush once we run out of IOVA space, is just too much better in that
scenario to just ignore. As one data point, on an NVMe I get about
_twice_ the IOPS when using our existing scheme compared to strict
mode. Which makes sense as IOTLB flushes are known as the bottleneck
and optimizing unmap like that reduces them by almost half. Queued
flushing is still much worse likely due to serialization of the
shadowing, though again it works great on LPAR. To make sure it's not
due to some bug in the IOMMU driver I even tried converting our
existing DMA driver to layer on top of the IOMMU driver with the same
result.