lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241231111314.GDZ3PRyq_tiU002p5d@fat_crate.local>
Date: Tue, 31 Dec 2024 12:13:14 +0100
From: Borislav Petkov <bp@...en8.de>
To: Feng Tang <feng.tang@...ux.alibaba.com>
Cc: "Huang, Ying" <ying.huang@...ux.alibaba.com>, rafael@...nel.org,
	Len Brown <lenb@...nel.org>, James Morse <james.morse@....com>,
	Tony Luck <tony.luck@...el.com>, linux-acpi@...r.kernel.org,
	linux-kernel@...r.kernel.org, Ira Weiny <ira.weiny@...el.com>,
	Dave Jiang <dave.jiang@...el.com>,
	Dan Williams <dan.j.williams@...el.com>,
	Andi Kleen <ak@...ux.intel.com>
Subject: Re: [PATCH] acpi/ghes: Make ghes_panic_timeout adjustable as a
 parameter

On Tue, Dec 31, 2024 at 06:15:59PM +0800, Feng Tang wrote:
> Thanks for the hint! IIUC, you are mentioning the set_arch_panic_timeout().
> One thing is, most ARCHs' default timeout is 0, while in our case, the user
> will also set 'panic=0' :), so we can't easily detect if the 0 is the user-set
> value or the OS default one. Originally I even thought about adding a flag
> of 'timeout_user_changed'.  Any suggestion?

Ok, enough talking. Let's get concrete:

From: "Borislav Petkov (AMD)" <bp@...en8.de>
Date: Tue, 31 Dec 2024 12:03:55 +0100
Subject: [PATCH] APEI: GHES: Have GHES honor the panic= setting

The GHES driver overrides the panic= setting by rebooting the system
after a fatal hw error has been reported. The intent being that such an
error would be hopefully written out faster on non-volatile storage for
later inspection.

However, this is not optimal when a hard-to-debug issue requires long
time to reproduce and when that happens, the box will get rebooted after
30 seconds and thus destroy the whole hw context of when the error
happened.

So rip out the default GHES panic timeout and honor the global one.

In the panic disabled (panic=0) case, the error will still be logged to
dmesg for later inspection and if panic after a hw error is really
required, then that can be controlled the usual way - use panic= on the
cmdline or set it in the kernel .config's CONFIG_PANIC_TIMEOUT.

Reported-by: Feng Tang <feng.tang@...ux.alibaba.com>
Signed-off-by: Borislav Petkov (AMD) <bp@...en8.de>
---
 drivers/acpi/apei/ghes.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 07789f0b59bc..b72772494655 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -173,8 +173,6 @@ static struct gen_pool *ghes_estatus_pool;
 static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
 static atomic_t ghes_estatus_cache_alloced;
 
-static int ghes_panic_timeout __read_mostly = 30;
-
 static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
 {
 	phys_addr_t paddr;
@@ -983,14 +981,16 @@ static void __ghes_panic(struct ghes *ghes,
 			 struct acpi_hest_generic_status *estatus,
 			 u64 buf_paddr, enum fixed_addresses fixmap_idx)
 {
+	const char *msg = GHES_PFX "Fatal hardware error";
+
 	__ghes_print_estatus(KERN_EMERG, ghes->generic, estatus);
 
 	ghes_clear_estatus(ghes, estatus, buf_paddr, fixmap_idx);
 
-	/* reboot to log the error! */
 	if (!panic_timeout)
-		panic_timeout = ghes_panic_timeout;
-	panic("Fatal hardware error!");
+		pr_emerg("%s but panic disabled\n", msg);
+
+	panic(msg);
 }
 
 static int ghes_proc(struct ghes *ghes)
-- 
2.43.0


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ