[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <c40b5e6cb26654f698e51b131956065b952ad222.camel@decadent.org.uk>
Date: Sat, 12 Jul 2025 17:12:30 +0200
From: Ben Hutchings <ben@...adent.org.uk>
To: intel-wired-lan@...ts.osuosl.org, linux-pci <linux-pci@...r.kernel.org>,
Pavan Chebbi <pavan.chebbi@...adcom.com>, Michael Chan <mchan@...adcom.com>
Cc: Laurent Bonnaud <L.Bonnaud@...oste.net>, 1104670@...s.debian.org,
netdev@...r.kernel.org
Subject: Re: Bug#1104670: linux-image-6.12.25-amd64: system does not shut
down - GHES: Fatal hardware error
Hi all,
On Sun, 2025-05-04 at 13:45 +0200, Laurent Bonnaud wrote:
[...]
> - Previously the kernel would output an error in /var/lib/systemd/pstore/ but would shutdown anyway.
>
> - Now, with kernel 6.1.135-1, the shutdown is blocked as with 6.12.x kernels (see below).
> --
> Laurent.
>
> <30>[ 961.098671] systemd-shutdown[1]: Rebooting.
> <6>[ 961.098743] kvm: exiting hardware virtualization
> <6>[ 961.361878] megaraid_sas 0000:17:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
> <6>[ 961.414526] ACPI: PM: Preparing to enter system sleep state S5
> <0>[ 963.828210] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
> <0>[ 963.828213] {1}[Hardware Error]: event severity: fatal
> <0>[ 963.828214] {1}[Hardware Error]: Error 0, type: fatal
> <0>[ 963.828216] {1}[Hardware Error]: section_type: PCIe error
> <0>[ 963.828216] {1}[Hardware Error]: port_type: 0, PCIe end point
> <0>[ 963.828217] {1}[Hardware Error]: version: 3.0
> <0>[ 963.828218] {1}[Hardware Error]: command: 0x0002, status: 0x0010
> <0>[ 963.828220] {1}[Hardware Error]: device_id: 0000:01:00.1
> <0>[ 963.828221] {1}[Hardware Error]: slot: 6
> <0>[ 963.828222] {1}[Hardware Error]: secondary_bus: 0x00
> <0>[ 963.828223] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x1563
> <0>[ 963.828224] {1}[Hardware Error]: class_code: 020000
> <0>[ 963.828225] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00018000
> <0>[ 963.828226] {1}[Hardware Error]: aer_uncor_severity: 0x000ef010
> <0>[ 963.828227] {1}[Hardware Error]: TLP Header: 40000001 0000000f 90028090 00000000
[...]
It seems that this is a known bug in the BIOS of several Dell PowerEdge
models including (in this case) the R540.
A workaround was added to the tg3 driver
<https://git.kernel.org/linus/e0efe83ed325277bb70f9435d4d9fc70bebdcca8>
and a similar change was proposed (but not accepted) in the i40e driver
<https://lore.kernel.org/all/20241227035459.90602-1-yue.zhao@shopee.com/>.
On tihis system the erorr log points to a deivce handled by the ixgbe
driver, and no workaround has been implemented for that.
Since this issue seems to affect multiple different NIC vendors and
drivers, would it make more sense to implement this workaround as a PCI
quirk?
Ben.
--
Ben Hutchings
Experience is directly proportional to the value of equipment destroyed
- Carolyn Scheppner
Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)
Powered by blists - more mailing lists