linux-kernel - Re: [PATCH v7 12/12] iommu: Improve iopf_queue_flush

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2354dd69-0179-4689-bc35-f4bf4ea5a886@linux.intel.com>
Date:   Mon, 4 Dec 2023 09:32:37 +0800
From:   Baolu Lu <baolu.lu@...ux.intel.com>
To:     Jason Gunthorpe <jgg@...pe.ca>
Cc:     baolu.lu@...ux.intel.com, Joerg Roedel <joro@...tes.org>,
        Will Deacon <will@...nel.org>,
        Robin Murphy <robin.murphy@....com>,
        Kevin Tian <kevin.tian@...el.com>,
        Jean-Philippe Brucker <jean-philippe@...aro.org>,
        Nicolin Chen <nicolinc@...dia.com>,
        Yi Liu <yi.l.liu@...el.com>,
        Jacob Pan <jacob.jun.pan@...ux.intel.com>,
        Yan Zhao <yan.y.zhao@...el.com>, iommu@...ts.linux.dev,
        kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v7 12/12] iommu: Improve iopf_queue_flush_dev()

On 12/3/23 10:14 PM, Jason Gunthorpe wrote:
> On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote:
>> On 12/2/23 4:35 AM, Jason Gunthorpe wrote:
>>> On Wed, Nov 15, 2023 at 11:02:26AM +0800, Lu Baolu wrote:
>>>> The iopf_queue_flush_dev() is called by the iommu driver before releasing
>>>> a PASID. It ensures that all pending faults for this PASID have been
>>>> handled or cancelled, and won't hit the address space that reuses this
>>>> PASID. The driver must make sure that no new fault is added to the queue.
>>> This needs more explanation, why should anyone care?
>>>
>>> More importantly, why is*discarding*  the right thing to do?
>>> Especially why would we discard a partial page request group?
>>>
>>> After we change a translation we may have PRI requests in a
>>> queue. They need to be acknowledged, not discarded. The DMA in the
>>> device should be restarted and the device should observe the new
>>> translation - if it is blocking then it should take a DMA error.
>>>
>>> More broadly, we should just let things run their normal course. The
>>> domain to deliver the fault to should be determined very early. If we
>>> get a fault and there is no fault domain currently assigned then just
>>> restart it.
>>>
>>> The main reason to fence would be to allow the domain to become freed
>>> as the faults should be holding pointers to it. But I feel there are
>>> simpler options for that then this..
>>
>> In the iommu_detach_device_pasid() path, the domain is about to be
>> removed from the pasid of device. The IOMMU driver performs the
>> following steps sequentially:
> 
> I know that is why it does, but it doesn't explain at all why.
> 
>> 1. Clears the pasid translation entry. Thus, all subsequent DMA
>>     transactions (translation requests, translated requests or page
>>     requests) targeting the iommu domain will be blocked.
>>
>> 2. Waits until all pending page requests for the device's PASID have
>>     been reported to upper layers via the iommu_report_device_fault().
>>     However, this does not guarantee that all page requests have been
>>     responded.
>>
>> 3. Free all partial page requests for this pasid since the page request
>>     response is only needed for a complete request group. There's no
>>     action required for the page requests which are not last of a request
>>     group.
> 
> But we expect the last to come eventually since everything should be
> grouped properly, so why bother doing this?
> 
> Indeed if 2 worked, how is this even possible to have partials?

Step 1 clears the pasid table entry, hence all subsequent page requests
are blocked (hardware auto-respond the request but not put it in the
queue).

It is possible that a portion of a page fault group may have been queued
for processing, but the last request is being blocked by hardware due
to the pasid entry being in the blocking state.

In reality, this may be a no-op as I haven't seen any real-world
implementations of multiple-requests fault groups on Intel platforms.

>   
>> 5. Follow the IOMMU hardware requirements (for example, VT-d sepc,
>>     section 7.10, Software Steps to Drain Page Requests & Responses) to
>>     drain in-flight page requests and page group responses between the
>>     remapping hardware queues and the endpoint device.
>>
>> With above steps done in iommu_detach_device_pasid(), the pasid could be
>> re-used for any other address space.
> 
> As I said, that isn't even required. There is no issue with leaking
> PRI's across attachments.
> 
> 
>>> I suppose the driver is expected to stop calling
>>> iommu_report_device_fault() before calling this function, but that
>>> doesn't seem like it is going to be possible. Drivers should be
>>> implementing atomic replace for the PASID updates and in that case
>>> there is no momement when it can say the HW will stop generating PRI.
>>
>> Atomic domain replacement for a PASID is not currently implemented in
>> the core or driver.
> 
> It is, the driver should implement set_dev_pasid in such a way that
> repeated calls do replacements, ideally atomically. This is what ARM
> SMMUv3 does after my changes.
> 
>> Even if atomic replacement were to be implemented,
>> it would be necessary to ensure that all translation requests,
>> translated requests, page requests and responses for the old domain are
>> drained before switching to the new domain.
> 
> Again, no it isn't required.
> 
> Requests simply have to continue to be acked, it doesn't matter if
> they are acked against the wrong domain because the device will simply
> re-issue them..

Ah! I start to get your point now.

Even a page fault response is postponed to a new address space, which
possibly be another address space or hardware blocking state, the
hardware just retries.

As long as we flushes all caches (IOTLB and device TLB) during 
switching, the mappings of the old domain won't leak. So it's safe to 
keep page requests there.

Do I get you correctly?

Best regards,
baolu