linux-kernel - Re: [PATCH] ACPI: APEI: GHES: Improve ghes_notify

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aQwDd-Nhgxpkdrcb@agluck-desk3>
Date: Wed, 5 Nov 2025 18:09:59 -0800
From: "Luck, Tony" <tony.luck@...el.com>
To: Shuai Xue <xueshuai@...ux.alibaba.com>
CC: "Rafael J. Wysocki" <rafael@...nel.org>, Borislav Petkov <bp@...en8.de>,
	Hanjun Guo <guohanjun@...wei.com>, Mauro Carvalho Chehab
	<mchehab@...nel.org>, <linux-acpi@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <patches@...ts.linux.dev>, Andi Kleen
	<andi.kleen@...el.com>
Subject: Re: [PATCH] ACPI: APEI: GHES: Improve ghes_notify_nmi() status check

On Thu, Nov 06, 2025 at 09:46:33AM +0800, Shuai Xue wrote:
> 
> 
> 在 2025/11/4 07:05, Tony Luck 写道:
> > ghes_notify_nmi() is called for every NMI and must check whether the NMI was
> > generated because an error was signalled by platform firmware.
> > 
> > This check is very expensive as for each registered GHES NMI source it reads
> > from the acpi generic address attached to this error source to get the physical
> > address of the acpi_hest_generic_status block.  It then checks the "block_status"
> > to see if an error was logged.
> > 
> > The ACPI/APEI code must create virtual mappings for each of those physical
> > addresses, and tear them down afterwards. On an Icelake system this takes around
> > 15,000 TSC cycles. Enough to disturb efforts to profile system performance.
> 
> Hi, Tony
> 
> Interesting.
> 
> If I understand correctly, you mean ghes_peek_estatus() and
> ghes_clear_estatus().
> 
> I conducted performance testing on our system (ARM v8) and found the
> following average costs:
> 
> - ghes_peek_estatus(): 8,138.3 ns (21,160 cycles)
> - ghes_clear_estatus(): 2,038.3 ns (5,300 cycles)

ARM doesn't use the NMI path (HAVE_ACPI_APEI_NMI is only set on X86).
But maybe you are looking at ghes_notify_sea() which seems similar?
> 
> > 
> > If that were not bad enough, there are some atomic accesses in the code path
> > that will cause cache line bounces between CPUs. A problem that gets worse as
> > the core count increases.
> 
> Could you elaborate on which specific atomic accesses you're referring to?

ghes_notify_nmi() starts with code to ensure only one CPU executes the
GHES NMI path.

	if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
		return ret;

Looks like an optimization to avoid having a bunch of CPUs queue up
waiting for this spinllock:

	raw_spin_lock(&ghes_notify_lock_nmi);

when the first one to get it will find and handle the logged error.
> 
> > 
> > But BIOS changes neither the acpi generic address nor the physical address of
> > the acpi_hest_generic_status block. So this walk can be done once when the NMI is
> > registered to save the virtual address (unmapping if the NMI is ever unregistered).
> > The "block_status" can be checked directly in the NMI handler. This can be done
> > without any atomic accesses.
> > 
> > Resulting time to check that there is not an error record is around 900 cycles.
> > 
> > Reported-by: Andi Kleen <andi.kleen@...el.com>
> > Signed-off-by: Tony Luck <tony.luck@...el.com>
> > 
> > ---
> > N.B. I only talked to an Intel BIOS expert about this. GHES code is shared by
> > other architectures, so it would be wise to get confirmation on whether this
> > assumption applies to all, or is Intel (or X86) specific.
> 
> The assumption is "BIOS changes neither the acpi generic address nor the
> physical address of the acpi_hest_generic_status block."?
> 
> I've consulted with our BIOS experts from both ARM and RISC-V platform
> teams, and they confirmed that error status blocks are reserved at boot
> time and remain unchanged during runtime.

Thanks. Good to have this confirmation.

> > ---
> >   include/acpi/ghes.h      |  1 +
> >   drivers/acpi/apei/ghes.c | 39 ++++++++++++++++++++++++++++++++++++---
> >   2 files changed, 37 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> > index ebd21b05fe6e..58655d313a1f 100644
> > --- a/include/acpi/ghes.h
> > +++ b/include/acpi/ghes.h
> > @@ -29,6 +29,7 @@ struct ghes {
> >   	};
> >   	struct device *dev;
> >   	struct list_head elist;
> > +	void __iomem *error_status_vaddr;
> >   };
> >   struct ghes_estatus_node {
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > index 97ee19f2cae0..62713b612865 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/drivers/acpi/apei/ghes.c
> > @@ -1425,7 +1425,21 @@ static LIST_HEAD(ghes_nmi);
> >   static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
> >   {
> >   	static DEFINE_RAW_SPINLOCK(ghes_notify_lock_nmi);
> > +	bool active_error = false;
> >   	int ret = NMI_DONE;
> > +	struct ghes *ghes;
> > +
> > +	rcu_read_lock();
> > +	list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
> > +		if (ghes->error_status_vaddr && readl(ghes->error_status_vaddr)) {
> > +			active_error = true;
> > +			break;
> > +		}
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	if (!active_error)
> > +		return ret;
> 
> Shoud we put active_error into struct ghes? If we know it is active, we
> do not need to call __ghes_peek_estatus() to estatus->block_status().

That might be a useful addition. I was primarily concerned in fixing the
"no erroor" case that happens at a very high rate while profiling the
system with "perf". But skipping (or just removing?
__ghes_peek_estatus()) if you have already confirmed that there is
a logged error would be good.

If you can use the same technique for ghes_notify_sea() then it would be
sensible to move the code I added to ghes_nmi_add() to ghes_new() to
save the virtual address for every type of GHES notification.
> 
> Thanks.
> Shuai

-Tony