linux-kernel - RE: [PATCH] iommu/vt-d: fix intel iommu iotlb sync hardlockup & retry

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <BN9PR11MB5276A8C1C3F3989F46A7D7418C65A@BN9PR11MB5276.namprd11.prod.outlook.com>
Date: Mon, 9 Feb 2026 05:52:09 +0000
From: "Tian, Kevin" <kevin.tian@...el.com>
To: "guanghuifeng@...ux.alibaba.com" <guanghuifeng@...ux.alibaba.com>, "Baolu
 Lu" <baolu.lu@...ux.intel.com>, "dwmw2@...radead.org" <dwmw2@...radead.org>,
	"joro@...tes.org" <joro@...tes.org>, "will@...nel.org" <will@...nel.org>,
	"robin.murphy@....com" <robin.murphy@....com>, "iommu@...ts.linux.dev"
	<iommu@...ts.linux.dev>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
CC: xunlei <xlpang@...ux.alibaba.com>
Subject: RE: [PATCH] iommu/vt-d: fix intel iommu iotlb sync hardlockup & retry

> From: guanghuifeng@...ux.alibaba.com <guanghuifeng@...ux.alibaba.com>
> Sent: Sunday, February 8, 2026 6:23 PM
> 
> 在 2026/2/6 10:55, Baolu Lu 写道:
> >
> > An obvious race that I can think of is something like this:
> >
> > Thread A placed a dev-tlb-inv-desc in the invalidation queue. After
> > that, thread B placed an iotlb-inv-desc in the queue. Now the requests
> > in the queue look like this:
> >
> >     dev-tlb-inv-desc for A
> >     iotlb-inv-desc for B
> >
> > Then a device TLB invalidation timeout error happens and triggers the
> > ITE bit to be set in the fault register. Thread B sees this in its
> > qi_check_fault(), clears the ITE bit, and returns -EAGAIN. Then thread A
> > will loop infinitely waiting for DONE in its wait-desc.

loop infinitely or exit the loop by happening to capture the next ITE
from newer submissions...

> >
> > The qi_submit_sync() logic has been there for years. Changing its
> > behavior without enough validation on real hardware will cause
> > unexpected issues. The better approach I would suggest is to fix the
> > outdated logic.
> >
> > Thanks,
> > baolu
> 
> Thank you for your reply.
> 
>  From the IOMMU VT-d documentation, it is known that the IOTLB
> maintenance
> process has sequential dependencies, for example, a context IOTLB flush
> must
> precede a PASID IOTLB flush. To ensure the execution order:
> 
> 
> Method 1: Requests in the invalid queue are executed sequentially,
> requiring
> only that the software submission order is valid. The IOMMU updates the
> head pointer only after completing a request.
> 
> 
> Method 2: Requests in the invalid queue are executed in parallel and out
> of order.
> The software ensures the execution order of multiple requests by adding
> wait_desc entries.

this is what the spec describes. Or more accurately it talks about the
hardware *fetching* one or more descriptors together, likely implying
the possibility of parallel execution.

> 
> If the Intel IOMMU uses Method 1, and only updates the head pointer upon
> completion
> of a request (even if a timeout occurs during the execution process
> after fetching the request),
> then after an ITE timeout, clearing the ITE status in any context will
> trigger the IOMMU to re-fetch
> and execute the request, without requiring resubmission of the
> descriptor request.
> 
> Therefore, could you please help me to confirm the current behavior
> pattern of the Intel IOMMU execution process?
> 

the spec is already clear about that process and the current sw flow (
though with a bug as Baolu pointed out) matches that process.

Multiple CPUs may be submitting descriptors to the invalidation
queue in parallel and in the poll loop (polling the wait status to
be QI_DONE or QI_ABORT), but there is only one CPU receiving
the ITE timeout interrupt.

The current logic is to have the CPU serving the interrupt to
update all wait descriptors before the head pointer to QI_ABORT,
effectively aborting  all waiting CPUs to resubmit their descriptors.

that is why there is a check at the start of qi_check_fault():

        if (qi->desc_status[wait_index] == QI_ABORT)
                return -EAGAIN;

It's broken now after the driver allows submitting multiple
descriptors in a batch, which breaks the assumption that the
wait descriptor always sits in the odd slots.

But your change skips that step by simply returning -EAGAIN on
that only CPU, with all other CPUs still in the wait loop as their
wait descriptor status is still QI_IN_USE.

this patch may work for you in a scenario where new descriptors
are continuously queued and new ITE timeout happens to be
triggered on those polling CPUs.

but it's not reliable and not a correct fix. you should still try to
update all wait descriptors' status to QI_ABORT. 

btw suppose the ITE timeout is defined to be large enough to
capture the max possible PCIe ATS invalidation timeout from
device side, but looks in your case the device is still responding
(just for some reason the response is too slow to fit the IOMMU
assumption). that's why retry could succeed?