linux-kernel - Re: [PATCH] iommu/vt-d: fix intel iommu iotlb sync hardlockup & retry

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f8e35741-6b4c-4e14-a0a5-2c3176840242@linux.intel.com>
Date: Fri, 6 Feb 2026 10:55:12 +0800
From: Baolu Lu <baolu.lu@...ux.intel.com>
To: "guanghuifeng@...ux.alibaba.com" <guanghuifeng@...ux.alibaba.com>,
 dwmw2@...radead.org, joro@...tes.org, will@...nel.org, robin.murphy@....com,
 iommu@...ts.linux.dev, linux-kernel@...r.kernel.org
Cc: xunlei <xlpang@...ux.alibaba.com>
Subject: Re: [PATCH] iommu/vt-d: fix intel iommu iotlb sync hardlockup & retry

On 2/5/26 18:28, guanghuifeng@...ux.alibaba.com wrote:
> 
> 在 2026/2/4 17:32, Baolu Lu 写道:
>> On 2/2/2026 10:09 AM, Guanghui Feng wrote:
>>> Device-TLB Invalidation Response Time-out (ITE) handling was added in
>>> commit: 6ba6c3a4cacfd68bf970e3e04e2ff0d66fa0f695.
>>>
>>> When an ITE occurs, iommu will sets the ITE (Invalidation Time-out
>>> Error) field in the Fault Status Register. No new descriptors are
>>> fetched from the Invalidation Queue until software clears the ITE field
>>> in the Fault Status Register. Tail pointer Register updates by software
>>> while the ITE field is Set does not cause descriptor fetches by
>>> hardware. At the time ITE field is Set, hardware aborts any
>>> inv_wait_dsc commands pending in hardware and does not increment
>>> the Invalidation Queue Head register. When software clears the
>>> ITE field in the Fault Status Register, hardware fetches
>>> descriptor pointed by the Invalidation Queue Head register.
>>>
>>> But in the qi_check_fault process, it is implemented by default
>>> according to the 2009 commit: 6ba6c3a4cacfd68bf970e3e04e2ff0d66fa0f695,
>>> that is, only one struct qi_desc is submitted at a time. A qi_desc 
>>> request is
>>> immediately followed by a wait_desc/QI_IWD_TYPE for
>>> synchronization. Therefore, the IOMMU driver implementation
>>> considers invalid queue entries at odd positions to be
>>> wait_desc. After ITE is set, hardware aborts any pending
>>> inv_wait_dsc commands in hardware. Therefore, qi_check_fault
>>> iterates through odd-position as wait_desc entries and sets
>>> desc_status to QI_ABORT. However, the current implementation
>>> allows multiple struct qi_desc to be submitted simultaneously,
>>> followed by one wait_desc, so it's no longer guaranteed that
>>> odd-position entries will be wait_desc. When the number of submitted
>>> struct qi_desc is even, wait_desc's desc_status will not be set to 
>>> QI_ABORT,
>>> qi_check_fault will return 0, and qi_submit_sync will then
>>> execute in an infinite loop and cause a hard lockup when
>>> interrupts are disabled and the PCIe device does not respond to
>>> Device-TLB Invalidation requests.
>>
>> Yes. This appears a real software bug.
>>
>>>
>>> Additionally, if the device remains online and an IOMMU ITE
>>> occurs, simply returning -EAGAIN is sufficient. When processing
>>> the -EAGAIN result, qi_submit_sync will automatically reclaim
>>> all submitted struct qi_desc and resubmit the requests.
>>>
>>> Through this modification:
>>> 1. Correctly triggers the resubmission of struct qi_desc when
>>> an ITE occurs.
>>> 2. Prevents the IOMMU driver from disabling interrupts and
>>> executing in an infinite loop within qi_submit_sync when an
>>> ITE occurs, avoiding hardlockup.
>>
>> But I think this fix changes the behavior of the driver.
>>
>> Previously, when an ITE error was detected, it cleared the ITE so that
>> hardware could keep going, aborted all wait-descriptors that were being
>> handled by hardware, and returned -EAGAIN if its own wait-descriptor was
>> impacted.
>>
>> This patch changes the behavior; it returns -EAGAIN directly whenever it
>> detects an ITE error, regardless of whether its wait-desc is impacted.
>> In the single-threaded case, it works as expected, but race condition
>> might occur when qi_submit_sync() is called in multiple threads at the
>> same time.
>>
>>>
>>> Signed-off-by: Guanghui Feng<guanghuifeng@...ux.alibaba.com>
>>> ---
>>>   drivers/iommu/intel/dmar.c | 18 +++---------------
>>>   1 file changed, 3 insertions(+), 15 deletions(-)
>>
>> Have you tried to fix it by dropping the "odd position" assumption? For
>> example, removing "head |= 1" and decrementing by 1 instead of 2 in the
>> loop?
>>
>>      do {
>>              if (qi->desc_status[head] == QI_IN_USE)
>>                      qi->desc_status[head] = QI_ABORT;
>>              head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>      } while (head != tail);
>>
>> Thanks,
>> baolu
> 
> Thank you for your reply.
> 
> There are a few points that need clarification:
> The descriptors between head and tail are requests that have not been 
> fetched and executed.
> 
> 
> Regarding the requests before the head:
> Method 1: Does the IOMMU update the head address register immediately 
> after fetching the descriptor?
> Method 2: Or does the IOMMU update the head register only after fetching 
> and executing the request?
> 
> The current Intel IOMMU VT-d specification does not describe this 
> behavior in detail.
> Does the IOMMU currently use Method 1?
> 
> Therefore, after an ITE timeout, it's necessary to resend the requests 
> before the head index

An obvious race that I can think of is something like this:

Thread A placed a dev-tlb-inv-desc in the invalidation queue. After
that, thread B placed an iotlb-inv-desc in the queue. Now the requests
in the queue look like this:

     dev-tlb-inv-desc for A
     iotlb-inv-desc for B

Then a device TLB invalidation timeout error happens and triggers the
ITE bit to be set in the fault register. Thread B sees this in its
qi_check_fault(), clears the ITE bit, and returns -EAGAIN. Then thread A
will loop infinitely waiting for DONE in its wait-desc.

The qi_submit_sync() logic has been there for years. Changing its
behavior without enough validation on real hardware will cause
unexpected issues. The better approach I would suggest is to fix the
outdated logic.

Thanks,
baolu