lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYZU09qCN3u-_byj@wunner.de>
Date: Fri, 6 Feb 2026 21:53:39 +0100
From: Lukas Wunner <lukas@...ner.de>
To: Keith Busch <kbusch@...nel.org>
Cc: Bjorn Helgaas <helgaas@...nel.org>, Breno Leitao <leitao@...ian.org>,
	Jonathan Corbet <corbet@....net>,
	Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
	Oliver O'Halloran <oohall@...il.com>,
	Bjorn Helgaas <bhelgaas@...gle.com>, linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
	linux-pci@...r.kernel.org, dcostantino@...a.com, rneu@...a.com,
	kernel-team@...a.com
Subject: Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors

On Fri, Feb 06, 2026 at 12:22:44PM -0700, Keith Busch wrote:
> On Fri, Feb 06, 2026 at 12:52:32PM -0600, Bjorn Helgaas wrote:
> > Are there any other similar flags you already use that we could
> > piggy-back on?  E.g., if we raised the level to KERN_WARNING, maybe
> > the existing "panic_on_warn" would be enough?
> 
> There are many KERN_WARNING messages that don't rise to the level of
> warranting a 'panic' that don't want to enable such an option in
> production. It looks like the panic_on_warn was introduced for developer
> debugging.

panic_on_warn springs into action on WARN() splats, not arbitrary
messages with KERN_WARNING severity.  Also, sysctl kernel.warn_limit
may be used to grant a certain number of panic-free WARNs.

FWIW, the "pcieportdrv.aer_unrecoverable_fatal" parameter introduced
by this patch feels somewhat oddly named.  Something like
"pci.panic_on_fatal" might be clearer and more succinct.

> I agree the curnent INFO level is too low for the generic unrecovered
> condition, though.

At least for unbound devices, I think 918b4053184c went way too far.
I think an unbound device should generally be considered recoverable
through a reset.

As for bound devices whose drivers lack pci_error_handlers, it has been
painful in practice that they're considered unrecoverable wholesale.
E.g. GPUs often expose an audio device as well as telemetry devices,
all arranged below an integrated PCIe switch.  All of these devices
need drivers with pci_error_handlers in order for the GPU to be
recoverable.  In some cases, dummy callbacks were added to render
the whole thing recoverable.

So I wouldn't consider 918b4053184c to have been a universally successful
approach and I fear that this patch goes even further.

Thanks,

Lukas

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ