lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251223145830.644-1-guojinhui.liam@bytedance.com>
Date: Tue, 23 Dec 2025 22:58:30 +0800
From: "Jinhui Guo" <guojinhui.liam@...edance.com>
To: <baolu.lu@...ux.intel.com>
Cc: <bhelgaas@...gle.com>, <dwmw2@...radead.org>, 
	<guojinhui.liam@...edance.com>, <iommu@...ts.linux.dev>, 
	<joro@...tes.org>, <kevin.tian@...el.com>, 
	<linux-kernel@...r.kernel.org>, <stable@...r.kernel.org>, 
	<will@...nel.org>
Subject: Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode

On Tue, Dec 23, 2025 12:06:24 +0800, Baolu Lu wrote:
> > On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
> >>> From: Jinhui Guo<guojinhui.liam@...edance.com>
> >>> Sent: Thursday, December 11, 2025 12:00 PM
> >>>
> >>> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
> >>> request when device is disconnected") relies on
> >>> pci_dev_is_disconnected() to skip ATS invalidation for
> >>> safely-removed devices, but it does not cover link-down caused
> >>> by faults, which can still hard-lock the system.
> >> According to the commit msg it actually tries to fix the hard lockup
> >> with surprise removal. For safe removal the device is not removed
> >> before invalidation is done:
> >>
> >> "
> >>      For safe removal, device wouldn't be removed until the whole software
> >>      handling process is done, it wouldn't trigger the hard lock up issue
> >>      caused by too long ATS Invalidation timeout wait.
> >> "
> >>
> >> Can you help articulate the problem especially about the part
> >> 'link-down caused by faults"? What are those faults? How are
> >> they different from the said surprise removal in the commit
> >> msg to not set pci_dev_is_disconnected()?
> >>
> > Hi, kevin, sorry for the delayed reply.
> > 
> > A normal or surprise removal of a PCIe device on a hot-plug port normally
> > triggers an interrupt from the PCIe switch.
> > 
> > We have, however, observed cases where no interrupt is generated when the
> > device suddenly loses its link; the behaviour is identical to setting the
> > Link Disable bit in the switch’s Link Control register (offset 10h). Exactly
> > what goes wrong in the LTSSM between the PCIe switch and the endpoint remains
> > unknown.
> 
> In this scenario, the hardware has effectively vanished, yet the device
> driver remains bound and the IOMMU resources haven't been released. I’m
> just curious if this stale state could trigger issues in other places
> before the kernel fully realizes the device is gone? I’m not objecting
> to the fix. I'm just interested in whether this 'zombie' state creates
> risks elsewhere.

Hi, Baolu

In our scenario we see no other issues; a hard-LOCKUP panic is triggered the
moment the Mellanox Ethernet device vanishes. But we can analyze what happens
when we access the Mellanox Ethernet device whose link is disabled.
(If we check whether the PCIe endpoint device (Mellanox Ethernet) is present
before issuing device-IOTLB invalidation to the Intel IOMMU, no other issues
appear.)

According to the PCIe spec, Rev. 5.0 v1.0, Sec. 2.4.1, there are two kinds of
TLPs: posted and non-posted. Non-posted TLPs require a completion TLP; posted
TLPs do not.

- A Posted Request is a Memory Write Request or a Message Request.
- A Read Request is a Configuration Read Request, an I/O Read Request, or a
  Memory Read Request.
- An NPR (Non-Posted Request) with Data is a Configuration Write Request, an
  I/O Write Request, or an AtomicOp Request.
- A Non-Posted Request is a Read Request or an NPR with Data.

When the CPU issues a PCIe memory-write TLP (posted) via a MOV instruction,
the instruction retires immediately after the packet reaches the Root Complex;
no Data-Link ACK/NAK is required. A memory-read TLP (non-posted), however, stalls
the core until the corresponding Completion TLP is received - if that Completion
never arrives, the CPU hangs. (The CPU hangs if the LTSSM does not enter the
Disabled state.)

However, if the LTSSM enters the Disabled state, the Root Port returns
Completer-Abort (CA) for any non-posted TLP, so the request completes with status
0xFFFFFFFF without stalling.

I ran some tests on the machine after setting the Link Disable bit in the switch’s
Link Control register (offset 10h).
- setpci -s 0000:3c:08.0 CAP_EXP+10.w=0x0010

 +-[0000:3a]-+-00.0-[3b-3f]----00.0-[3c-3f]--+-00.0-[3d]----
 |           |                               +-04.0-[3e]----
 |           |                               \-08.0-[3f]----00.0  Mellanox Technologies MT27800 Family [ConnectX-5]

 # lspci -vvv -s 0000:3f:00.0
 3f:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
 ...
         Region 0: Memory at 3af804000000 (64-bit, prefetchable) [size=32M]
 ...

1) Issue a PCI config-space read request and it returns 0xFFFFFFFF.
 # lspci -vvv -s 0000:3f:00.0
 3f:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] (rev ff) (prog-if ff)
         !!! Unknown header type 7f
         Kernel driver in use: mlx5_core
         Kernel modules: mlx5_core

2) Issuing a PCI memory read request through /dev/mem also returns 0xFFFFFFFF.
 # ./devmem
 Usage: ./devmem <phys_addr> <size> <offset> [value]
   phys_addr : physical base address of the BAR (hex or decimal)
   size      : mapping length in bytes (hex or decimal)
   offset    : register offset from BAR base (hex or decimal)
   value     : optional 32-bit value to write (hex or decimal)
 Example: ./devmem 0x600000000 0x1000 0x0 0xDEADBEEF
 # ./devmem 0x3af804000000 0x2000000 0x0
 0x3af804000000 = 0xffffffff

 Before the link was disabled, we could read 0x3af804000000 with devmem and
 obtain a valid result.
 # ./devmem 0x3af804000000 0x2000000 0x0
 0x3af804000000 = 0x10002300

Besides, after searching the kernel code, I found many EP drivers already check
whether their endpoint is still present. There may be exception cases in some
PCIe endpoint drivers, such as commit 43bb40c5b926 ("virtio_pci: Support surprise
removal of virtio pci device").

Best Regards,
Jinhui

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ