linux-kernel - Re: [PATCH] ACPI: APEI: GHES: Improve ghes_notify

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cd918c98-ef71-4f2e-a02f-6e003ca03ea3@linux.alibaba.com>
Date: Fri, 7 Nov 2025 10:20:05 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: "Luck, Tony" <tony.luck@...el.com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Borislav Petkov <bp@...en8.de>,
 Hanjun Guo <guohanjun@...wei.com>, Mauro Carvalho Chehab
 <mchehab@...nel.org>, linux-acpi@...r.kernel.org,
 linux-kernel@...r.kernel.org, patches@...ts.linux.dev,
 Andi Kleen <andi.kleen@...el.com>
Subject: Re: [PATCH] ACPI: APEI: GHES: Improve ghes_notify_nmi() status check



在 2025/11/7 02:03, Luck, Tony 写道:
> On Thu, Nov 06, 2025 at 01:04:05PM +0800, Shuai Xue wrote:
>>
>>
>> 在 2025/11/6 10:09, Luck, Tony 写道:
>>> On Thu, Nov 06, 2025 at 09:46:33AM +0800, Shuai Xue wrote:
>>>>
>>>>
>>>> 在 2025/11/4 07:05, Tony Luck 写道:
>>>>> ghes_notify_nmi() is called for every NMI and must check whether the NMI was
>>>>> generated because an error was signalled by platform firmware.
>>>>>
>>>>> This check is very expensive as for each registered GHES NMI source it reads
>>>>> from the acpi generic address attached to this error source to get the physical
>>>>> address of the acpi_hest_generic_status block.  It then checks the "block_status"
>>>>> to see if an error was logged.
>>>>>
>>>>> The ACPI/APEI code must create virtual mappings for each of those physical
>>>>> addresses, and tear them down afterwards. On an Icelake system this takes around
>>>>> 15,000 TSC cycles. Enough to disturb efforts to profile system performance.
>>>>
>>>> Hi, Tony
>>>>
>>>> Interesting.
>>>>
>>>> If I understand correctly, you mean ghes_peek_estatus() and
>>>> ghes_clear_estatus().
>>>>
>>>> I conducted performance testing on our system (ARM v8) and found the
>>>> following average costs:
>>>>
>>>> - ghes_peek_estatus(): 8,138.3 ns (21,160 cycles)
>>>> - ghes_clear_estatus(): 2,038.3 ns (5,300 cycles)
>>>
>>> ARM doesn't use the NMI path (HAVE_ACPI_APEI_NMI is only set on X86).
>>> But maybe you are looking at ghes_notify_sea() which seems similar?
>>
>> Yes. It is measured in ghes_notify_sea().
>>
>>>>
>>>>>
>>>>> If that were not bad enough, there are some atomic accesses in the code path
>>>>> that will cause cache line bounces between CPUs. A problem that gets worse as
>>>>> the core count increases.
>>>>
>>>> Could you elaborate on which specific atomic accesses you're referring to?
>>>
>>> ghes_notify_nmi() starts with code to ensure only one CPU executes the
>>> GHES NMI path.
>>>
>>> 	if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
>>> 		return ret;
>>>
>>> Looks like an optimization to avoid having a bunch of CPUs queue up
>>> waiting for this spinllock:
>>>
>>> 	raw_spin_lock(&ghes_notify_lock_nmi);
>>>
>>> when the first one to get it will find and handle the logged error.
>>
>> If an NMI issued, at last one status block is active. I don't see how
>> the code path is different.
> 
> On X86 NMI doesn't include a vector. There is just one interrupt entry
> point for any NMI.
> 
> Linux deals with this with each subsystem that may expect an NMI
> registering on an NMI notify chain. When an NMI happens the core
> code calls each subsystem function to ask:
> 
> 	Was this NMI for you?
> 	Was this NMI for you?
> 	Was this NMI for you?
> 	Was this NMI for you?
> 	Was this NMI for you?
> 	Was this NMI for you?
> 	Was this NMI for you?
> 	Was this NMI for you?
> 
> This is a problem when profiling the system with "perf" as there are
> thousands of NMIs, and each one calls ghes_notify_nmi() which will look
> to see if an error was logged, and (after all the ACPI/APEI mapping
> physical addresses, reading, and unmapping) will return NMI_DONE to
> indicate that no error was found, so this NMI came from some other
> source.

I see, thank you the datails explaination.

>>
>>>>
>>>>>
>>>>> But BIOS changes neither the acpi generic address nor the physical address of
>>>>> the acpi_hest_generic_status block. So this walk can be done once when the NMI is
>>>>> registered to save the virtual address (unmapping if the NMI is ever unregistered).
>>>>> The "block_status" can be checked directly in the NMI handler. This can be done
>>>>> without any atomic accesses.
>>>>>
>>>>> Resulting time to check that there is not an error record is around 900 cycles.
>>>>>
>>>>> Reported-by: Andi Kleen <andi.kleen@...el.com>
>>>>> Signed-off-by: Tony Luck <tony.luck@...el.com>
>>>>>
>>>>> ---
>>>>> N.B. I only talked to an Intel BIOS expert about this. GHES code is shared by
>>>>> other architectures, so it would be wise to get confirmation on whether this
>>>>> assumption applies to all, or is Intel (or X86) specific.
>>>>
>>>> The assumption is "BIOS changes neither the acpi generic address nor the
>>>> physical address of the acpi_hest_generic_status block."?
>>>>
>>>> I've consulted with our BIOS experts from both ARM and RISC-V platform
>>>> teams, and they confirmed that error status blocks are reserved at boot
>>>> time and remain unchanged during runtime.
>>>
>>> Thanks. Good to have this confirmation.
>>>
>>>>> ---
>>>>>     include/acpi/ghes.h      |  1 +
>>>>>     drivers/acpi/apei/ghes.c | 39 ++++++++++++++++++++++++++++++++++++---
>>>>>     2 files changed, 37 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
>>>>> index ebd21b05fe6e..58655d313a1f 100644
>>>>> --- a/include/acpi/ghes.h
>>>>> +++ b/include/acpi/ghes.h
>>>>> @@ -29,6 +29,7 @@ struct ghes {
>>>>>     	};
>>>>>     	struct device *dev;
>>>>>     	struct list_head elist;
>>>>> +	void __iomem *error_status_vaddr;
>>>>>     };
>>>>>     struct ghes_estatus_node {
>>>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>>>> index 97ee19f2cae0..62713b612865 100644
>>>>> --- a/drivers/acpi/apei/ghes.c
>>>>> +++ b/drivers/acpi/apei/ghes.c
>>>>> @@ -1425,7 +1425,21 @@ static LIST_HEAD(ghes_nmi);
>>>>>     static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>>>>>     {
>>>>>     	static DEFINE_RAW_SPINLOCK(ghes_notify_lock_nmi);
>>>>> +	bool active_error = false;
>>>>>     	int ret = NMI_DONE;
>>>>> +	struct ghes *ghes;
>>>>> +
>>>>> +	rcu_read_lock();
>>>>> +	list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
>>>>> +		if (ghes->error_status_vaddr && readl(ghes->error_status_vaddr)) {
>>>>> +			active_error = true;
>>>>> +			break;
>>>>> +		}
>>>>> +	}
>>>>> +	rcu_read_unlock();
>>>>> +
>>>>> +	if (!active_error)
>>>>> +		return ret;
>>>>
>>>> Shoud we put active_error into struct ghes? If we know it is active, we
>>>> do not need to call __ghes_peek_estatus() to estatus->block_status().
>>>
>>> That might be a useful addition. I was primarily concerned in fixing the
>>> "no erroor" case that happens at a very high rate while profiling the
>>> system with "perf".
>>
>> Do you mean you see "NMI received for unknown reason" when profiling with
>> "perf"? And the unknown error is handled by ghes_notify_nmi().
> 
> This message is printed when every handler on the NMI chain says "not
> my interrupt" with the NMI_DONE return value.

$sudo perf script
swapper     0 [148] 1117148.086594: probe:ghes_notify_nmi: (ffffffffb3867210)
         ffffffffb3867211 ghes_notify_nmi+0x1 (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb303c366 nmi_handle.part.0+0x56 (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb3dc7e0a default_do_nmi+0x11a (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb3dc7f40 exc_nmi+0x100 (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb3e0169a end_repeat_nmi+0x16 (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb3732915 _find_next_and_bit+0x65 (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb3177780 update_sg_lb_stats+0xc0 (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb317d5dc update_sd_lb_stats.constprop.0+0xcc (/lib/modules/6.6.102+/build/vmlinu>
         ffffffffb317d7d1 find_busiest_group+0x41 (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb317dc0a load_balance+0x12a (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb317efd4 rebalance_domains+0x2b4 (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb31235d2 handle_softirqs+0xd2 (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb31238b3 __irq_exit_rcu+0xa3 (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb3dca462 sysvec_apic_timer_interrupt+0x72 (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb3e00e86 asm_sysvec_apic_timer_interrupt+0x16 (/lib/modules/6.6.102+/build/vmlin>
         ffffffffb3dcbbee cpuidle_enter_state+0xbe (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb3adc6b9 cpuidle_enter+0x29 (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb31901ea cpuidle_idle_call+0xfa (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb31902db do_idle+0x7b (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb3190536 cpu_startup_entry+0x26 (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb30775d5 start_secondary+0x115 (/lib/modules/6.6.102+/build/vmlinux)
         ffffffffb30001e7 secondary_startup_64_no_verify+0x182 (/lib/modules/6.6.102+/build/vmlin>

Yes, I even find that ghes_notify_nmi() is called when no perf is used! (:

>>
>> I see some unknown NMI in production in Intel platform, but I did not
>> figure out how it happend. Can you help to explain it?
>>
>>> But skipping (or just removing?
>>> __ghes_peek_estatus()) if you have already confirmed that there is
>>> a logged error would be good.
>>>
>>> If you can use the same technique for ghes_notify_sea() then it would be
>>> sensible to move the code I added to ghes_nmi_add() to ghes_new() to
>>> save the virtual address for every type of GHES notification.
>>
>> Sure, I'd like to add it for ghes_notify_sea().
> 
> Great. Take my patch and fix it up to cover ghes_notify_sea() as well. I
> don't have a system that I can test that on. When you are done I can
> retest the ghes_notify_nmi() to check that it still works.

Got it.

Thanks.
Shuai