linux-kernel - Re: [RFC PATCH v10 0/5] fix vt-d hard lockup when hotplug ATS capable device

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d72d0a12-f3a4-4b4d-8b3b-5e59937a21d3@linux.intel.com>
Date: Wed, 17 Jan 2024 17:00:32 +0800
From: Ethan Zhao <haifeng.zhao@...ux.intel.com>
To: Baolu Lu <baolu.lu@...ux.intel.com>, kevin.tian@...el.com,
 bhelgaas@...gle.com, dwmw2@...radead.org, will@...nel.org,
 robin.murphy@....com, lukas@...ner.de
Cc: linux-pci@...r.kernel.org, iommu@...ts.linux.dev,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v10 0/5] fix vt-d hard lockup when hotplug ATS capable
 device


On 1/17/2024 1:38 PM, Ethan Zhao wrote:
>
> On 1/17/2024 1:26 PM, Ethan Zhao wrote:
>>
>> On 1/17/2024 11:24 AM, Baolu Lu wrote:
>>> On 2024/1/15 15:58, Ethan Zhao wrote:
>>>> -static int qi_check_fault(struct intel_iommu *iommu, int index, 
>>>> int wait_index)
>>>> +static int qi_check_fault(struct intel_iommu *iommu, int index, 
>>>> int wait_index,
>>>> +                  pci_dev *target_pdev)
>>>>   {
>>>>          u32 fault;
>>>>          int head, tail;
>>>> +       u64 iqe_err, ice_sid;
>>>>          struct q_inval *qi = iommu->qi;
>>>>          int shift = qi_shift(iommu);
>>>>
>>>>          if (qi->desc_status[wait_index] == QI_ABORT)
>>>>                  return -EAGAIN;
>>>>
>>>> +       /*
>>>> +        * If the ATS invalidation target device is gone this 
>>>> moment (surprise
>>>> +        * removed, died, no response) don't try this request 
>>>> again. this
>>>> +        * request will not get valid result anymore. but the 
>>>> request was
>>>> +        * already submitted to hardware and we predict to get a 
>>>> ITE in
>>>> +        * followed batch of request, if so, it will get handled then.
>>>> +        */
>>>
>>> We can't leave the ITE triggered by this request for the next one, 
>>> which
>>> has no context about why this happened. Perhaps move below code down to
>>> the segment that handles ITEs.
>>
>> Here, the invalidation request has been issued to hardware but target 
>> device
>>
>> gone, we can't loop and wait for the ITE for this request to happen, 
>> and we
>>
>> bail out here because we hold lock_irqsave lock , the ITE still could 
>> happen
>>
>> with later batch request in the future,  though it is not triggered 
>> by that request,
>>
>> but it could still be cleaned/handled. move it to the fault() segment 
>> ?,there means
>
> That moment, the ITE was triggered by previous requests, they are not 
> in current
>
> context, also shouldn't be retried, they have response time over 
> expected.
>
> but triggered ITE fault blocks this patch request, we should retry 
> this batch request.
>
> we just clean the fault and retry it.  nothing missed.
>
>
> Thanks,
>
> Ethan
>
>>
>> ITE already happened, no need to check target presence anymore.
>>
>> did I miss something about the context lost ?
>>
>>>
>>> Another concern is about qi_dump_fault(), which pr_err's the fault
>>> message as long as the register is set. Some faults are predictable,
>>> such as cache invalidation for surprise-removed devices. 
>>> Unconditionally
>>> reporting errors with pr_err() may lead the user to believe that a more
>>> serious hardware error has occurred. Probably we can refine this 
>>> part of
>>> the code as well.
>>
>> Agree, may refine them in seperated series ?
>>
>> loop and always retry IQE, ICE don't make sense per my 
>> understanding.  if
>>
>> IQE happened retry it will always reproduce the fault, because 
>> request is the same.
>>
>> we could fix them together in other patches.
>>
>>
>> Thanks,
>>
>> Ethan
>>
>>>
>>> Others look sane to me.
>>>
>>>> +       if (target_pdev && !pci_device_is_present(target_pdev))
>>>> +               return -EINVAL;
>>>> +
>>>>          fault = readl(iommu->reg + DMAR_FSTS_REG);
>>>>          if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE))
>>>>                  qi_dump_fault(iommu, fault);
>>>> @@ -1315,6 +1327,13 @@ static int qi_check_fault(struct intel_iommu 
>>>> *iommu, int index, int wait_index)
>>>>                  tail = readl(iommu->reg + DMAR_IQT_REG);
>>>>                  tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
>>>>
>>>> +               /*
>>>> +                * SID field is valid only when the ITE field is 
>>>> Set in FSTS_REG
>>>> +                * see Intel VT-d spec r4.1, section 11.4.9.9
>>>> +                */
>>>> +               iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
>>>> +               ice_sid = DMAR_IQER_REG_ITESID(iqe_err);
>>>> +
>>>>                  writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
>>>>                  pr_info("Invalidation Time-out Error (ITE) 
>>>> cleared\n");
>>>>
>>>> @@ -1324,6 +1343,16 @@ static int qi_check_fault(struct intel_iommu 
>>>> *iommu, int index, int wait_index)
>>>>                          head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>>>                  } while (head != tail);
>>>>
>>>> +               /*
>>>> +                * If got ITE, we need to check if the sid of ITE 
>>>> is the same as
>>>> +                * current ATS invalidation target device, if yes, 
>>>> don't try this
>>>> +                * request anymore, the target device has a 
>>>> response time beyound
>>>> +                * expected. 0 value of ice_sid means old device, 
>>>> no ice_sid value.
>>>> +                */
>>>> +               if (target_pdev && ice_sid && ice_sid ==
>>>> +                   pci_dev_id(pci_physfn(target_pdev))
>>>> +                               return -ETIMEDOUT;
>>>> +
>>>>                  if (qi->desc_status[wait_index] == QI_ABORT)
>>>>                          return -EAGAIN;
>>>>          }
>>>
>>> Best regards,
>>> baolu
>>
>
new proposed qi_check_fault() change as following:

- remove the lines of

   if (qi->desc_status[wait_index] == QI_ABORT)

      return -EAGAIN;

   reason see the comment in the code.

- others, the same as last proposed code.


--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1267,15 +1267,33 @@ static void qi_dump_fault(struct intel_iommu 
*iommu, u32 fault)
                (unsigned long long)desc->qw1);
  }

-static int qi_check_fault(struct intel_iommu *iommu, int index, int 
wait_index)
+static int qi_check_fault(struct intel_iommu *iommu, int index, int 
wait_index,
+                  pci_dev *target_pdev)
  {
         u32 fault;
         int head, tail;
+       u64 iqe_err, ice_sid;
         struct q_inval *qi = iommu->qi;
         int shift = qi_shift(iommu);
+       /*
+        * If the inv_wait_dsc got QI_ABORT here, means an ITE happens, 
retry
+        * this batch of invalidation request without clearing the ITE fault
+        * will be trapped into dead loop. remov this line.
+        * see Intel VT-d spec r4.1 sec 6.5.2.10, 6.5.2.8
+        *
+        * if (qi->desc_status[wait_index] == QI_ABORT)
+        *      return -EAGAIN;
+        */

-       if (qi->desc_status[wait_index] == QI_ABORT)
-               return -EAGAIN;
+       /*
+        * If the ATS invalidation target device is gone this moment 
(surprise
+        * removed, died, no response) don't try this request again. this
+        * request will not get valid result anymore. but the request was
+        * already submitted to hardware and we predict to get a ITE in
+        * followed batch of request, if so, it will get handled then.
+        */
+       if (target_pdev && !pci_device_is_present(target_pdev))
+               return -EINVAL;

         fault = readl(iommu->reg + DMAR_FSTS_REG);
         if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE))
@@ -1315,6 +1333,13 @@ static int qi_check_fault(struct intel_iommu 
*iommu, int index, int wait_index)
                 tail = readl(iommu->reg + DMAR_IQT_REG);
                 tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;

+               /*
+                * SID field is valid only when the ITE field is Set in 
FSTS_REG
+                * see Intel VT-d spec r4.1, section 11.4.9.9
+                */
+               iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
+               ice_sid = DMAR_IQER_REG_ITESID(iqe_err);
+
                 writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
                 pr_info("Invalidation Time-out Error (ITE) cleared\n");

@@ -1324,6 +1349,16 @@ static int qi_check_fault(struct intel_iommu 
*iommu, int index, int wait_index)
                         head = (head - 2 + QI_LENGTH) % QI_LENGTH;
                 } while (head != tail);

+               /*
+                * If got ITE, we need to check if the sid of ITE is the 
same as
+                * current ATS invalidation target device, if yes, don't 
try this
+                * request anymore, the target device has a response 
time beyound
+                * expected. 0 value of ice_sid means old device, no 
ice_sid value.
+                */
+               if (target_pdev && ice_sid && ice_sid ==
+                   pci_dev_id(pci_physfn(target_pdev))
+                       return -ETIMEDOUT;
+
                 if (qi->desc_status[wait_index] == QI_ABORT)
                         return -EAGAIN;
         }
@@ -1344,7 +1379,7 @@ static int qi_check_fault(struct intel_iommu 
*iommu, int index, int wait_index)
   * can be part of the submission but it will not be polled for completion.
   */
  int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
-                  unsigned int count, unsigned long options)
+                  unsigned int count, unsigned long options, pci_dev 
*target_pdev)
  {
         struct q_inval *qi = iommu->qi;
         s64 devtlb_start_ktime = 0;
@@ -1430,7 +1465,7 @@ int qi_submit_sync(struct intel_iommu *iommu, 
struct qi_desc *desc,
                  * a deadlock where the interrupt context can wait 
indefinitely
                  * for free slots in the queue.
                  */
-               rc = qi_check_fault(iommu, index, wait_index);
+               rc = qi_check_fault(iommu, index, wait_index, target_pdev);
                 if (rc)
                         break;

Thanks,

Ethan