linux-kernel - Re: [PATCH] ACPI: APEI: GHES: Improve ghes_notify

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aQzi83dxBold4Rmy@agluck-desk3>
Date: Thu, 6 Nov 2025 10:03:31 -0800
From: "Luck, Tony" <tony.luck@...el.com>
To: Shuai Xue <xueshuai@...ux.alibaba.com>
CC: "Rafael J. Wysocki" <rafael@...nel.org>, Borislav Petkov <bp@...en8.de>,
	Hanjun Guo <guohanjun@...wei.com>, Mauro Carvalho Chehab
	<mchehab@...nel.org>, <linux-acpi@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <patches@...ts.linux.dev>, Andi Kleen
	<andi.kleen@...el.com>
Subject: Re: [PATCH] ACPI: APEI: GHES: Improve ghes_notify_nmi() status check

On Thu, Nov 06, 2025 at 01:04:05PM +0800, Shuai Xue wrote:
> 
> 
> 在 2025/11/6 10:09, Luck, Tony 写道:
> > On Thu, Nov 06, 2025 at 09:46:33AM +0800, Shuai Xue wrote:
> > > 
> > > 
> > > 在 2025/11/4 07:05, Tony Luck 写道:
> > > > ghes_notify_nmi() is called for every NMI and must check whether the NMI was
> > > > generated because an error was signalled by platform firmware.
> > > > 
> > > > This check is very expensive as for each registered GHES NMI source it reads
> > > > from the acpi generic address attached to this error source to get the physical
> > > > address of the acpi_hest_generic_status block.  It then checks the "block_status"
> > > > to see if an error was logged.
> > > > 
> > > > The ACPI/APEI code must create virtual mappings for each of those physical
> > > > addresses, and tear them down afterwards. On an Icelake system this takes around
> > > > 15,000 TSC cycles. Enough to disturb efforts to profile system performance.
> > > 
> > > Hi, Tony
> > > 
> > > Interesting.
> > > 
> > > If I understand correctly, you mean ghes_peek_estatus() and
> > > ghes_clear_estatus().
> > > 
> > > I conducted performance testing on our system (ARM v8) and found the
> > > following average costs:
> > > 
> > > - ghes_peek_estatus(): 8,138.3 ns (21,160 cycles)
> > > - ghes_clear_estatus(): 2,038.3 ns (5,300 cycles)
> > 
> > ARM doesn't use the NMI path (HAVE_ACPI_APEI_NMI is only set on X86).
> > But maybe you are looking at ghes_notify_sea() which seems similar?
> 
> Yes. It is measured in ghes_notify_sea().
> 
> > > 
> > > > 
> > > > If that were not bad enough, there are some atomic accesses in the code path
> > > > that will cause cache line bounces between CPUs. A problem that gets worse as
> > > > the core count increases.
> > > 
> > > Could you elaborate on which specific atomic accesses you're referring to?
> > 
> > ghes_notify_nmi() starts with code to ensure only one CPU executes the
> > GHES NMI path.
> > 
> > 	if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
> > 		return ret;
> > 
> > Looks like an optimization to avoid having a bunch of CPUs queue up
> > waiting for this spinllock:
> > 
> > 	raw_spin_lock(&ghes_notify_lock_nmi);
> > 
> > when the first one to get it will find and handle the logged error.
> 
> If an NMI issued, at last one status block is active. I don't see how
> the code path is different.

On X86 NMI doesn't include a vector. There is just one interrupt entry
point for any NMI.

Linux deals with this with each subsystem that may expect an NMI
registering on an NMI notify chain. When an NMI happens the core
code calls each subsystem function to ask:

	Was this NMI for you?
	Was this NMI for you?
	Was this NMI for you?
	Was this NMI for you?
	Was this NMI for you?
	Was this NMI for you?
	Was this NMI for you?
	Was this NMI for you?

This is a problem when profiling the system with "perf" as there are
thousands of NMIs, and each one calls ghes_notify_nmi() which will look
to see if an error was logged, and (after all the ACPI/APEI mapping
physical addresses, reading, and unmapping) will return NMI_DONE to
indicate that no error was found, so this NMI came from some other
source.
> 
> > > 
> > > > 
> > > > But BIOS changes neither the acpi generic address nor the physical address of
> > > > the acpi_hest_generic_status block. So this walk can be done once when the NMI is
> > > > registered to save the virtual address (unmapping if the NMI is ever unregistered).
> > > > The "block_status" can be checked directly in the NMI handler. This can be done
> > > > without any atomic accesses.
> > > > 
> > > > Resulting time to check that there is not an error record is around 900 cycles.
> > > > 
> > > > Reported-by: Andi Kleen <andi.kleen@...el.com>
> > > > Signed-off-by: Tony Luck <tony.luck@...el.com>
> > > > 
> > > > ---
> > > > N.B. I only talked to an Intel BIOS expert about this. GHES code is shared by
> > > > other architectures, so it would be wise to get confirmation on whether this
> > > > assumption applies to all, or is Intel (or X86) specific.
> > > 
> > > The assumption is "BIOS changes neither the acpi generic address nor the
> > > physical address of the acpi_hest_generic_status block."?
> > > 
> > > I've consulted with our BIOS experts from both ARM and RISC-V platform
> > > teams, and they confirmed that error status blocks are reserved at boot
> > > time and remain unchanged during runtime.
> > 
> > Thanks. Good to have this confirmation.
> > 
> > > > ---
> > > >    include/acpi/ghes.h      |  1 +
> > > >    drivers/acpi/apei/ghes.c | 39 ++++++++++++++++++++++++++++++++++++---
> > > >    2 files changed, 37 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> > > > index ebd21b05fe6e..58655d313a1f 100644
> > > > --- a/include/acpi/ghes.h
> > > > +++ b/include/acpi/ghes.h
> > > > @@ -29,6 +29,7 @@ struct ghes {
> > > >    	};
> > > >    	struct device *dev;
> > > >    	struct list_head elist;
> > > > +	void __iomem *error_status_vaddr;
> > > >    };
> > > >    struct ghes_estatus_node {
> > > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > > > index 97ee19f2cae0..62713b612865 100644
> > > > --- a/drivers/acpi/apei/ghes.c
> > > > +++ b/drivers/acpi/apei/ghes.c
> > > > @@ -1425,7 +1425,21 @@ static LIST_HEAD(ghes_nmi);
> > > >    static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
> > > >    {
> > > >    	static DEFINE_RAW_SPINLOCK(ghes_notify_lock_nmi);
> > > > +	bool active_error = false;
> > > >    	int ret = NMI_DONE;
> > > > +	struct ghes *ghes;
> > > > +
> > > > +	rcu_read_lock();
> > > > +	list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
> > > > +		if (ghes->error_status_vaddr && readl(ghes->error_status_vaddr)) {
> > > > +			active_error = true;
> > > > +			break;
> > > > +		}
> > > > +	}
> > > > +	rcu_read_unlock();
> > > > +
> > > > +	if (!active_error)
> > > > +		return ret;
> > > 
> > > Shoud we put active_error into struct ghes? If we know it is active, we
> > > do not need to call __ghes_peek_estatus() to estatus->block_status().
> > 
> > That might be a useful addition. I was primarily concerned in fixing the
> > "no erroor" case that happens at a very high rate while profiling the
> > system with "perf".
> 
> Do you mean you see "NMI received for unknown reason" when profiling with
> "perf"? And the unknown error is handled by ghes_notify_nmi().

This message is printed when every handler on the NMI chain says "not
my interrupt" with the NMI_DONE return value.
> 
> I see some unknown NMI in production in Intel platform, but I did not
> figure out how it happend. Can you help to explain it?
> 
> > But skipping (or just removing?
> > __ghes_peek_estatus()) if you have already confirmed that there is
> > a logged error would be good.
> > 
> > If you can use the same technique for ghes_notify_sea() then it would be
> > sensible to move the code I added to ghes_nmi_add() to ghes_new() to
> > save the virtual address for every type of GHES notification.
> 
> Sure, I'd like to add it for ghes_notify_sea().

Great. Take my patch and fix it up to cover ghes_notify_sea() as well. I
don't have a system that I can test that on. When you are done I can
retest the ghes_notify_nmi() to check that it still works.
> 
> Thanks.
> Shuai

-Tony
>