linux-kernel - Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251222111935.489-1-guojinhui.liam@bytedance.com>
Date: Mon, 22 Dec 2025 19:19:35 +0800
From: "Jinhui Guo" <guojinhui.liam@...edance.com>
To: <kevin.tian@...el.com>
Cc: <baolu.lu@...ux.intel.com>, <dwmw2@...radead.org>, 
	<guojinhui.liam@...edance.com>, <iommu@...ts.linux.dev>, 
	<joro@...tes.org>, <linux-kernel@...r.kernel.org>, 
	<stable@...r.kernel.org>, <will@...nel.org>
Subject: Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode

On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
> > From: Jinhui Guo <guojinhui.liam@...edance.com>
> > Sent: Thursday, December 11, 2025 12:00 PM
> > 
> > Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
> > request when device is disconnected") relies on
> > pci_dev_is_disconnected() to skip ATS invalidation for
> > safely-removed devices, but it does not cover link-down caused
> > by faults, which can still hard-lock the system.
> 
> According to the commit msg it actually tries to fix the hard lockup
> with surprise removal. For safe removal the device is not removed
> before invalidation is done:
> 
> "
>     For safe removal, device wouldn't be removed until the whole software
>     handling process is done, it wouldn't trigger the hard lock up issue
>     caused by too long ATS Invalidation timeout wait.
> "
> 
> Can you help articulate the problem especially about the part
> 'link-down caused by faults"? What are those faults? How are
> they different from the said surprise removal in the commit
> msg to not set pci_dev_is_disconnected()?
> 

Hi, kevin, sorry for the delayed reply.

A normal or surprise removal of a PCIe device on a hot-plug port normally
triggers an interrupt from the PCIe switch.

We have, however, observed cases where no interrupt is generated when the
device suddenly loses its link; the behaviour is identical to setting the
Link Disable bit in the switch’s Link Control register (offset 10h). Exactly
what goes wrong in the LTSSM between the PCIe switch and the endpoint remains
unknown.

> > 
> > For example, if a VM fails to connect to the PCIe device,
> 
> 'failed' for what reason?
> 
> > "virsh destroy" is executed to release resources and isolate
> > the fault, but a hard-lockup occurs while releasing the group fd.
> > 
> > Call Trace:
> >  qi_submit_sync
> >  qi_flush_dev_iotlb
> >  intel_pasid_tear_down_entry
> >  device_block_translation
> >  blocking_domain_attach_dev
> >  __iommu_attach_device
> >  __iommu_device_set_domain
> >  __iommu_group_set_domain_internal
> >  iommu_detach_group
> >  vfio_iommu_type1_detach_group
> >  vfio_group_detach_container
> >  vfio_group_fops_release
> >  __fput
> > 
> > Although pci_device_is_present() is slower than
> > pci_dev_is_disconnected(), it still takes only ~70 µs on a
> > ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed
> > and width increase.
> > 
> > Besides, devtlb_invalidation_with_pasid() is called only in the
> > paths below, which are far less frequent than memory map/unmap.
> > 
> > 1. mm-struct release
> > 2. {attach,release}_dev
> > 3. set/remove PASID
> > 4. dirty-tracking setup
> > 
> 
> surprise removal can happen at any time, e.g. after the check of
> pci_device_is_present(). In the end we need the logic in
> qi_check_fault() to check the presence upon ITE timeout error
> received to break the infinite loop. So in your case even with
> that logici in place you still observe lockup (probably due to
> hardware ITE timeout is longer than the lockup detection on 
> the CPU?

Are you referring to the timeout added in patch
https://lore.kernel.org/all/20240222090251.2849702-4-haifeng.zhao@linux.intel.com/ ?

Our lockup-detection timeout is the default 10 s.

We see ITE-timeout messages in the kernel log. Yet the system still
hard-locks—probably because, as you mentioned, the hardware ITE timeout
is longer than the CPU’s lockup-detection window. I’ll reproduce the
case and follow up with a deeper analysis.

kernel: [ 2402.642685][  T607] vfio-pci 0000:3f:00.0: Unable to change power state from D0 to D3hot, device inaccessible
kernel: [ 2403.441828][T49880] DMAR: VT-d detected Invalidation Time-out Error: SID 0
kernel: [ 2403.441830][    C0] DMAR: DRHD: handling fault status reg 40
kernel: [ 2403.441831][T49880] DMAR: QI HEAD: Invalidation Wait qw0 = 0x200000025, qw1 = 0x1003a07fc
kernel: [ 2403.441833][T49880] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x1003a07f8
kernel: [ 2403.441879][T49880] DMAR: Invalidation Time-out Error (ITE) cleared
kernel: [ 2423.643527][    C7] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
kernel: [ 2423.643551][    C7] rcu:        8-...0: (0 ticks this GP) idle=198c/1/0x4000000000000000 softirq=19450/19450 fqs=4403
kernel: [ 2423.643567][    C7] rcu:        (detected by 7, t=21002 jiffies, g=238909, q=4932 ncpus=96)
kernel: [ 2423.643578][    C7] Sending NMI from CPU 7 to CPUs 8:
kernel: [ 2423.643581][    C8] NMI backtrace for cpu 8
kernel: [ 2423.643585][    C8] CPU: 8 UID: 0 PID: 49880 Comm: vfio_test Kdump: loaded Tainted: G S          E       6.18.0 #5 PREEMPT(voluntary)
kernel: [ 2423.643588][    C8] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE
kernel: [ 2423.643589][    C8] Hardware name: Inspur NF5468M5/YZMB-01130-105, BIOS 4.2.0 04/28/2021
kernel: [ 2423.643590][    C8] RIP: 0010:qi_submit_sync+0x6cf/0x8d0
kernel: [ 2423.643597][    C8] Code: 89 4c 24 50 89 70 34 48 c7 c7 f0 f5 4a a5 e8 48 15 89 ff 48 8b 4c 24 50 8b 54 24 58 49 8b 76 10 49 63 c7 48 8d 04 86 83 38 01 <75> 06 c7 00 03 00 00 00 41 81 c7 fe 00 00 00 44 89 f8 c1 f8 1f c1
kernel: [ 2423.643598][    C8] RSP: 0018:ffffb5a3bd0a7a30 EFLAGS: 00000097
kernel: [ 2423.643600][    C8] RAX: ffff9dac803a06bc RBX: 0000000000000000 RCX: 0000000000000000
kernel: [ 2423.643601][    C8] RDX: 00000000000000fe RSI: ffff9dac803a0400 RDI: ffff9ddb0081d480
kernel: [ 2423.643602][    C8] RBP: ffff9dac8037fe00 R08: 0000000000000000 R09: 0000000000000003
kernel: [ 2423.643603][    C8] R10: ffffb5a3bd0a78e0 R11: ffff9e0bbff3c068 R12: 0000000000000040
kernel: [ 2423.643605][    C8] R13: ffff9dac80314600 R14: ffff9dac8037fe00 R15: 00000000000000af
kernel: [ 2423.643606][    C8] FS:  0000000000000000(0000) GS:ffff9ddb5a262000(0000) knlGS:0000000000000000
kernel: [ 2423.643607][    C8] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [ 2423.643608][    C8] CR2: 000000002aee3000 CR3: 000000024a27b002 CR4: 00000000007726f0
kernel: [ 2423.643610][    C8] PKRU: 55555554
kernel: [ 2423.643611][    C8] Call Trace:
kernel: [ 2423.643613][    C8]  <TASK>
kernel: [ 2423.643616][    C8]  ? __pfx_domain_context_clear_one_cb+0x10/0x10
kernel: [ 2423.643620][    C8]  qi_flush_dev_iotlb+0xd5/0xe0
kernel: [ 2423.643622][    C8]  __context_flush_dev_iotlb.part.0+0x3c/0x80
kernel: [ 2423.643625][    C8]  domain_context_clear_one_cb+0x16/0x20
kernel: [ 2423.643626][    C8]  pci_for_each_dma_alias+0x3b/0x140
kernel: [ 2423.643631][    C8]  device_block_translation+0x122/0x180
kernel: [ 2423.643634][    C8]  blocking_domain_attach_dev+0x39/0x50
kernel: [ 2423.643636][    C8]  __iommu_attach_device+0x1b/0x90
kernel: [ 2423.643639][    C8]  __iommu_device_set_domain+0x5d/0xb0
kernel: [ 2423.643642][    C8]  __iommu_group_set_domain_internal+0x60/0x110
kernel: [ 2423.643644][    C8]  iommu_detach_group+0x3a/0x60
kernel: [ 2423.643650][    C8]  vfio_iommu_type1_detach_group+0x106/0x610 [vfio_iommu_type1]
kernel: [ 2423.643654][    C8]  ? __dentry_kill+0x12a/0x180
kernel: [ 2423.643660][    C8]  ? __pm_runtime_idle+0x44/0xe0
kernel: [ 2423.643666][    C8]  vfio_group_detach_container+0x4f/0x160 [vfio]
kernel: [ 2423.643672][    C8]  vfio_group_fops_release+0x3e/0x80 [vfio]
kernel: [ 2423.643677][    C8]  __fput+0xe6/0x2b0
kernel: [ 2423.643682][    C8]  task_work_run+0x58/0x90
kernel: [ 2423.643688][    C8]  do_exit+0x29b/0xa80
kernel: [ 2423.643694][    C8]  do_group_exit+0x2c/0x80
kernel: [ 2423.643696][    C8]  get_signal+0x8f9/0x900
kernel: [ 2423.643700][    C8]  arch_do_signal_or_restart+0x29/0x210
kernel: [ 2423.643704][    C8]  ? __schedule+0x582/0xe80
kernel: [ 2423.643708][    C8]  exit_to_user_mode_loop+0x8e/0x4f0
kernel: [ 2423.643712][    C8]  do_syscall_64+0x262/0x630
kernel: [ 2423.643717][    C8]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
kernel: [ 2423.643720][    C8] RIP: 0033:0x7fde19078514
kernel: [ 2423.643722][    C8] Code: Unable to access opcode bytes at 0x7fde190784ea.
kernel: [ 2423.643723][    C8] RSP: 002b:00007ffd0e1dc7e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
kernel: [ 2423.643724][    C8] RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007fde19078514
kernel: [ 2423.643726][    C8] RDX: 00007fde1916e8c0 RSI: 000055b217303260 RDI: 0000000000000000
kernel: [ 2423.643727][    C8] RBP: 00007ffd0e1dc8a0 R08: 00007fde19173500 R09: 0000000000000000
kernel: [ 2423.643728][    C8] R10: fffffffffffffbea R11: 0000000000000246 R12: 000055b1f8d8d0b0
kernel: [ 2423.643729][    C8] R13: 00007ffd0e1dc980 R14: 0000000000000000 R15: 0000000000000000
kernel: [ 2423.643731][    C8]  </TASK>
kernel: [ 2424.375254][T81463] vfio-pci 0000:3f:00.0: Unable to change power state from D3cold to D0, device inaccessible
...
kernel: [ 2448.327929][    C8] watchdog: CPU8: Watchdog detected hard LOCKUP on cpu 8
kernel: [ 2448.327932][    C8] Modules linked in: vfio_pci(E) vfio_pci_core(E) vfio_iommu_type1(E) vfio(E) udp_diag(E) tcp_diag(E) inet_diag(E) binfmt_misc(E) ip_set_hash_net(E) nft_compat(E) x_tables(E) ip_set(E) msr(E) nf_tables(E) ...
kernel: [ 2448.327963][    C8]  ib_core(E) hid_generic(E) usbhid(E) hid(E) ahci(E) libahci(E) xhci_pci(E) libata(E) nvme(E) xhci_hcd(E) i2c_i801(E) nvme_core(E) usbcore(E) scsi_mod(E) mlx5_core(E) i2c_smbus(E) lpc_ich(E) usb_common(E) scsi_common(E) wmi(E)
kernel: [ 2448.327972][    C8] CPU: 8 UID: 0 PID: 49880 Comm: vfio_test Kdump: loaded Tainted: G S          EL      6.18.0 #5 PREEMPT(voluntary)
kernel: [ 2448.327975][    C8] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE, [L]=SOFTLOCKUP
kernel: [ 2448.327976][    C8] Hardware name: Inspur NF5468M5/YZMB-01130-105, BIOS 4.2.0 04/28/2021
kernel: [ 2448.327977][    C8] RIP: 0010:qi_submit_sync+0x6e7/0x8d0
kernel: [ 2448.327981][    C8] Code: 8b 54 24 58 49 8b 76 10 49 63 c7 48 8d 04 86 83 38 01 75 06 c7 00 03 00 00 00 41 81 c7 fe 00 00 00 44 89 f8 c1 f8 1f c1 e8 18 <41> 01 c7 45 0f b6 ff 41 29 c7 44 39 fa 75 cb 48 85 c9 0f 85 05 01
kernel: [ 2448.327983][    C8] RSP: 0018:ffffb5a3bd0a7a30 EFLAGS: 00000046
kernel: [ 2448.327984][    C8] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
kernel: [ 2448.327985][    C8] RDX: 00000000000000fe RSI: ffff9dac803a0400 RDI: ffff9ddb0081d480
kernel: [ 2448.327986][    C8] RBP: ffff9dac8037fe00 R08: 0000000000000000 R09: 0000000000000003
kernel: [ 2448.327987][    C8] R10: ffffb5a3bd0a78e0 R11: ffff9e0bbff3c068 R12: 0000000000000040
kernel: [ 2448.327988][    C8] R13: ffff9dac80314600 R14: ffff9dac8037fe00 R15: 00000000000001b3
kernel: [ 2448.327989][    C8] FS:  0000000000000000(0000) GS:ffff9ddb5a262000(0000) knlGS:0000000000000000
kernel: [ 2448.327990][    C8] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [ 2448.327991][    C8] CR2: 000000002aee3000 CR3: 000000024a27b002 CR4: 00000000007726f0
kernel: [ 2448.327992][    C8] PKRU: 55555554
kernel: [ 2448.327993][    C8] Call Trace:
kernel: [ 2448.327995][    C8]  <TASK>
kernel: [ 2448.327997][    C8]  ? __pfx_domain_context_clear_one_cb+0x10/0x10
kernel: [ 2448.328000][    C8]  qi_flush_dev_iotlb+0xd5/0xe0
kernel: [ 2448.328002][    C8]  __context_flush_dev_iotlb.part.0+0x3c/0x80
kernel: [ 2448.328004][    C8]  domain_context_clear_one_cb+0x16/0x20
kernel: [ 2448.328006][    C8]  pci_for_each_dma_alias+0x3b/0x140
kernel: [ 2448.328010][    C8]  device_block_translation+0x122/0x180
kernel: [ 2448.328012][    C8]  blocking_domain_attach_dev+0x39/0x50
kernel: [ 2448.328014][    C8]  __iommu_attach_device+0x1b/0x90
kernel: [ 2448.328017][    C8]  __iommu_device_set_domain+0x5d/0xb0
kernel: [ 2448.328019][    C8]  __iommu_group_set_domain_internal+0x60/0x110
kernel: [ 2448.328021][    C8]  iommu_detach_group+0x3a/0x60
kernel: [ 2448.328023][    C8]  vfio_iommu_type1_detach_group+0x106/0x610 [vfio_iommu_type1]
kernel: [ 2448.328026][    C8]  ? __dentry_kill+0x12a/0x180
kernel: [ 2448.328030][    C8]  ? __pm_runtime_idle+0x44/0xe0
kernel: [ 2448.328035][    C8]  vfio_group_detach_container+0x4f/0x160 [vfio]
kernel: [ 2448.328041][    C8]  vfio_group_fops_release+0x3e/0x80 [vfio]
kernel: [ 2448.328046][    C8]  __fput+0xe6/0x2b0
kernel: [ 2448.328049][    C8]  task_work_run+0x58/0x90
kernel: [ 2448.328053][    C8]  do_exit+0x29b/0xa80
kernel: [ 2448.328057][    C8]  do_group_exit+0x2c/0x80
kernel: [ 2448.328060][    C8]  get_signal+0x8f9/0x900
kernel: [ 2448.328064][    C8]  arch_do_signal_or_restart+0x29/0x210
kernel: [ 2448.328068][    C8]  ? __schedule+0x582/0xe80
kernel: [ 2448.328070][    C8]  exit_to_user_mode_loop+0x8e/0x4f0
kernel: [ 2448.328074][    C8]  do_syscall_64+0x262/0x630
kernel: [ 2448.328076][    C8]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
kernel: [ 2448.328078][    C8] RIP: 0033:0x7fde19078514
kernel: [ 2448.328080][    C8] Code: Unable to access opcode bytes at 0x7fde190784ea.
kernel: [ 2448.328081][    C8] RSP: 002b:00007ffd0e1dc7e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
kernel: [ 2448.328082][    C8] RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007fde19078514
kernel: [ 2448.328083][    C8] RDX: 00007fde1916e8c0 RSI: 000055b217303260 RDI: 0000000000000000
kernel: [ 2448.328085][    C8] RBP: 00007ffd0e1dc8a0 R08: 00007fde19173500 R09: 0000000000000000
kernel: [ 2448.328085][    C8] R10: fffffffffffffbea R11: 0000000000000246 R12: 000055b1f8d8d0b0
kernel: [ 2448.328086][    C8] R13: 00007ffd0e1dc980 R14: 0000000000000000 R15: 0000000000000000
kernel: [ 2448.328088][    C8]  </TASK>
kernel: [ 2450.245901][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 41s! [mongoosev3-agen:4727]

> 
> In any case this change cannot 100% fix the lockup. It just
> reduces the possibility which should be made clear.

I agree with the above, but it's better to cover more corner cases.

Best Regards,
Jinhui