linux-kernel - Re: [PATCH v6 15/15] x86/mce: Save and use APEI corrected threshold limit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6e2d5351-dbd2-4e28-8bdb-b961fade5ebc@suse.com>
Date: Fri, 19 Sep 2025 13:42:57 +0300
From: Nikolay Borisov <nik.borisov@...e.com>
To: Yazen Ghannam <yazen.ghannam@....com>
Cc: x86@...nel.org, Tony Luck <tony.luck@...el.com>,
 "Rafael J. Wysocki" <rafael@...nel.org>, linux-kernel@...r.kernel.org,
 linux-edac@...r.kernel.org, Smita.KoralahalliChannabasappa@....com,
 Qiuxu Zhuo <qiuxu.zhuo@...el.com>, linux-acpi@...r.kernel.org
Subject: Re: [PATCH v6 15/15] x86/mce: Save and use APEI corrected threshold
 limit



On 9/15/25 20:33, Yazen Ghannam wrote:
> On Thu, Sep 11, 2025 at 08:01:17PM +0300, Nikolay Borisov wrote:
>>
>>
>> On 9/8/25 18:40, Yazen Ghannam wrote:
>>> The MCA threshold limit generally is not something that needs to change
>>> during runtime. It is common for a system administrator to decide on a
>>> policy for their managed systems.
>>>
>>> If MCA thresholding is OS-managed, then the threshold limit must be set
>>> at every boot. However, many systems allow the user to set a value in
>>> their BIOS. And this is reported through an APEI HEST entry even if
>>> thresholding is not in FW-First mode.
>>>
>>> Use this value, if available, to set the OS-managed threshold limit.
>>> Users can still override it through sysfs if desired for testing or
>>> debug.
>>>
>>> APEI is parsed after MCE is initialized. So reset the thresholding
>>> blocks later to pick up the threshold limit.
>>>
>>> Signed-off-by: Yazen Ghannam <yazen.ghannam@....com>
>>> ---
>>>
>>> Notes:
>>>       Link:
>>>       https://lore.kernel.org/r/20250825-wip-mca-updates-v5-20-865768a2eef8@amd.com
>>>       v5->v6:
>>>       * No change.
>>>       v4->v5:
>>>       * No change.
>>>       v3->v4:
>>>       * New in v4.
>>>
>>>    arch/x86/include/asm/mce.h          |  6 ++++++
>>>    arch/x86/kernel/acpi/apei.c         |  2 ++
>>>    arch/x86/kernel/cpu/mce/amd.c       | 18 ++++++++++++++++--
>>>    arch/x86/kernel/cpu/mce/internal.h  |  2 ++
>>>    arch/x86/kernel/cpu/mce/threshold.c | 13 +++++++++++++
>>>    5 files changed, 39 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
>>> index 7d6588195d56..1cfbfff0be3f 100644
>>> --- a/arch/x86/include/asm/mce.h
>>> +++ b/arch/x86/include/asm/mce.h
>>> @@ -308,6 +308,12 @@ DECLARE_PER_CPU(struct mce, injectm);
>>>    /* Disable CMCI/polling for MCA bank claimed by firmware */
>>>    extern void mce_disable_bank(int bank);
>>> +#ifdef CONFIG_X86_MCE_THRESHOLD
>>> +void mce_save_apei_thr_limit(u32 thr_limit);
>>> +#else
>>> +static inline void mce_save_apei_thr_limit(u32 thr_limit) { }
>>> +#endif /* CONFIG_X86_MCE_THRESHOLD */
>>> +
>>>    /*
>>>     * Exception handler
>>>     */
>>> diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c
>>> index 0916f00a992e..e21419e686eb 100644
>>> --- a/arch/x86/kernel/acpi/apei.c
>>> +++ b/arch/x86/kernel/acpi/apei.c
>>> @@ -19,6 +19,8 @@ int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data)
>>>    	if (!cmc->enabled)
>>>    		return 0;
>>> +	mce_save_apei_thr_limit(cmc->notify.error_threshold_value);
>>> +
>>>    	/*
>>>    	 * We expect HEST to provide a list of MC banks that report errors
>>>    	 * in firmware first mode. Otherwise, return non-zero value to
>>> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
>>> index b895559e80ad..9b746080351f 100644
>>> --- a/arch/x86/kernel/cpu/mce/amd.c
>>> +++ b/arch/x86/kernel/cpu/mce/amd.c
>>> @@ -489,6 +489,18 @@ static void threshold_restart_bank(unsigned int bank, bool intr_en)
>>>    	}
>>>    }
>>> +/* Try to use the threshold limit reported through APEI. */
>>> +static u16 get_thr_limit(void)
>>> +{
>>> +	u32 thr_limit = mce_get_apei_thr_limit();
>>> +
>>> +	/* Fallback to old default if APEI limit is not available. */
>>> +	if (!thr_limit)
>>> +		return THRESHOLD_MAX;
>>> +
>>> +	return min(thr_limit, THRESHOLD_MAX);
>>> +}
>>> +
>>>    static void mce_threshold_block_init(struct threshold_block *b, int offset)
>>>    {
>>>    	struct thresh_restart tr = {
>>> @@ -497,7 +509,7 @@ static void mce_threshold_block_init(struct threshold_block *b, int offset)
>>>    		.lvt_off		= offset,
>>>    	};
>>> -	b->threshold_limit		= THRESHOLD_MAX;
>>> +	b->threshold_limit		= get_thr_limit();
>>>    	threshold_restart_block(&tr);
>>>    };
>>> @@ -1071,7 +1083,7 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
>>>    	b->address		= address;
>>>    	b->interrupt_enable	= 0;
>>>    	b->interrupt_capable	= lvt_interrupt_supported(bank, high);
>>> -	b->threshold_limit	= THRESHOLD_MAX;
>>> +	b->threshold_limit	= get_thr_limit();
>>>    	if (b->interrupt_capable) {
>>>    		default_attrs[2] = &interrupt_enable.attr;
>>> @@ -1082,6 +1094,8 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
>>>    	list_add(&b->miscj, &tb->miscj);
>>> +	mce_threshold_block_init(b, (high & MASK_LVTOFF_HI) >> 20);
>>
>> Why is this necessary? Shouldn't this patch consist of mainly
>> s/THRESHOLD_MAX/get_thr_limit();
>>
>>
>> In allocate_threshold_block have already properly set threshold_limit. So
>> this change really ensures threshold_restart_block is being called for the
>> given block being initialized. Ignoring the changed threshold limit logic,
>> why is this extra call necessary now and wasn't before?
>>
> 
> It is necessary to apply the threshold limit to the hardware register.
> The MCA thresholding registers are accessed in two passes: first during
> per-CPU init, and second when the MCE subsystem devices are created.
> 
> The hardware registers are updated in the first pass, and they are left
> as-is in the second pass assuming no configuration has changed. That's
> why there isn't a "reset" in the second pass.
> 
> The APEI tables are parsed between the first and second passes. So now
> we need to update the registers during the second pass to apply the
> value found from HEST.

So APEI is initialized as part of the subsys_initcall which is processed 
via:

start_kernel
rest_init
kernel_init
kernel_init_freeable
do_basic_setup
do_initcalls

And the first mce_threshold_block_init() happens from :

start_kernel
arch_cpu_finalize_init <---- way before rest_init()
identify_boot_cpu
identify_cpu
mcheck_cpu_init
mcheck_cpu_init_vendor
mce_amd_feature_init
prepare_threshold_block
mce_threshold_block_init


Finally the per-cpu hotplug callback is installed via:

mcheck_init_device <- initiated from a device_initcall, happens after 
APEI subsys init.
mce_cpu_online  - called on every cpu from the HP machinery
mce_threshold_create_device
threshold_create_bank
allocate_threshold_blocks
mce_threshold_block_init - newly added call in alloc block, used to 
update the register with the new limit


Given that mce_cpu_online is already called on every online cpu at the 
time of installation of the callback, and every subsequent cpu that will 
come online I can't help but wonder why do we have to do the mce 
initialization from start_kernel, can't we move the mcheck_cpu_init call 
into mce_cpu_online, or have the latter subsume the former?

Sorry if I'm being too nitpicky, I just want to have proper 
understanding of the subsystem and the various (implicit) requirements 
it has.


At the very least I believe this commit message should at least allude 
to the fact that mce threshold devices have a strict requirement to be 
created after the APEI subsys has been created.


> 
> Thanks,
> Yazen