[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6e2d5351-dbd2-4e28-8bdb-b961fade5ebc@suse.com>
Date: Fri, 19 Sep 2025 13:42:57 +0300
From: Nikolay Borisov <nik.borisov@...e.com>
To: Yazen Ghannam <yazen.ghannam@....com>
Cc: x86@...nel.org, Tony Luck <tony.luck@...el.com>,
"Rafael J. Wysocki" <rafael@...nel.org>, linux-kernel@...r.kernel.org,
linux-edac@...r.kernel.org, Smita.KoralahalliChannabasappa@....com,
Qiuxu Zhuo <qiuxu.zhuo@...el.com>, linux-acpi@...r.kernel.org
Subject: Re: [PATCH v6 15/15] x86/mce: Save and use APEI corrected threshold
limit
On 9/15/25 20:33, Yazen Ghannam wrote:
> On Thu, Sep 11, 2025 at 08:01:17PM +0300, Nikolay Borisov wrote:
>>
>>
>> On 9/8/25 18:40, Yazen Ghannam wrote:
>>> The MCA threshold limit generally is not something that needs to change
>>> during runtime. It is common for a system administrator to decide on a
>>> policy for their managed systems.
>>>
>>> If MCA thresholding is OS-managed, then the threshold limit must be set
>>> at every boot. However, many systems allow the user to set a value in
>>> their BIOS. And this is reported through an APEI HEST entry even if
>>> thresholding is not in FW-First mode.
>>>
>>> Use this value, if available, to set the OS-managed threshold limit.
>>> Users can still override it through sysfs if desired for testing or
>>> debug.
>>>
>>> APEI is parsed after MCE is initialized. So reset the thresholding
>>> blocks later to pick up the threshold limit.
>>>
>>> Signed-off-by: Yazen Ghannam <yazen.ghannam@....com>
>>> ---
>>>
>>> Notes:
>>> Link:
>>> https://lore.kernel.org/r/20250825-wip-mca-updates-v5-20-865768a2eef8@amd.com
>>> v5->v6:
>>> * No change.
>>> v4->v5:
>>> * No change.
>>> v3->v4:
>>> * New in v4.
>>>
>>> arch/x86/include/asm/mce.h | 6 ++++++
>>> arch/x86/kernel/acpi/apei.c | 2 ++
>>> arch/x86/kernel/cpu/mce/amd.c | 18 ++++++++++++++++--
>>> arch/x86/kernel/cpu/mce/internal.h | 2 ++
>>> arch/x86/kernel/cpu/mce/threshold.c | 13 +++++++++++++
>>> 5 files changed, 39 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
>>> index 7d6588195d56..1cfbfff0be3f 100644
>>> --- a/arch/x86/include/asm/mce.h
>>> +++ b/arch/x86/include/asm/mce.h
>>> @@ -308,6 +308,12 @@ DECLARE_PER_CPU(struct mce, injectm);
>>> /* Disable CMCI/polling for MCA bank claimed by firmware */
>>> extern void mce_disable_bank(int bank);
>>> +#ifdef CONFIG_X86_MCE_THRESHOLD
>>> +void mce_save_apei_thr_limit(u32 thr_limit);
>>> +#else
>>> +static inline void mce_save_apei_thr_limit(u32 thr_limit) { }
>>> +#endif /* CONFIG_X86_MCE_THRESHOLD */
>>> +
>>> /*
>>> * Exception handler
>>> */
>>> diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c
>>> index 0916f00a992e..e21419e686eb 100644
>>> --- a/arch/x86/kernel/acpi/apei.c
>>> +++ b/arch/x86/kernel/acpi/apei.c
>>> @@ -19,6 +19,8 @@ int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data)
>>> if (!cmc->enabled)
>>> return 0;
>>> + mce_save_apei_thr_limit(cmc->notify.error_threshold_value);
>>> +
>>> /*
>>> * We expect HEST to provide a list of MC banks that report errors
>>> * in firmware first mode. Otherwise, return non-zero value to
>>> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
>>> index b895559e80ad..9b746080351f 100644
>>> --- a/arch/x86/kernel/cpu/mce/amd.c
>>> +++ b/arch/x86/kernel/cpu/mce/amd.c
>>> @@ -489,6 +489,18 @@ static void threshold_restart_bank(unsigned int bank, bool intr_en)
>>> }
>>> }
>>> +/* Try to use the threshold limit reported through APEI. */
>>> +static u16 get_thr_limit(void)
>>> +{
>>> + u32 thr_limit = mce_get_apei_thr_limit();
>>> +
>>> + /* Fallback to old default if APEI limit is not available. */
>>> + if (!thr_limit)
>>> + return THRESHOLD_MAX;
>>> +
>>> + return min(thr_limit, THRESHOLD_MAX);
>>> +}
>>> +
>>> static void mce_threshold_block_init(struct threshold_block *b, int offset)
>>> {
>>> struct thresh_restart tr = {
>>> @@ -497,7 +509,7 @@ static void mce_threshold_block_init(struct threshold_block *b, int offset)
>>> .lvt_off = offset,
>>> };
>>> - b->threshold_limit = THRESHOLD_MAX;
>>> + b->threshold_limit = get_thr_limit();
>>> threshold_restart_block(&tr);
>>> };
>>> @@ -1071,7 +1083,7 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
>>> b->address = address;
>>> b->interrupt_enable = 0;
>>> b->interrupt_capable = lvt_interrupt_supported(bank, high);
>>> - b->threshold_limit = THRESHOLD_MAX;
>>> + b->threshold_limit = get_thr_limit();
>>> if (b->interrupt_capable) {
>>> default_attrs[2] = &interrupt_enable.attr;
>>> @@ -1082,6 +1094,8 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
>>> list_add(&b->miscj, &tb->miscj);
>>> + mce_threshold_block_init(b, (high & MASK_LVTOFF_HI) >> 20);
>>
>> Why is this necessary? Shouldn't this patch consist of mainly
>> s/THRESHOLD_MAX/get_thr_limit();
>>
>>
>> In allocate_threshold_block have already properly set threshold_limit. So
>> this change really ensures threshold_restart_block is being called for the
>> given block being initialized. Ignoring the changed threshold limit logic,
>> why is this extra call necessary now and wasn't before?
>>
>
> It is necessary to apply the threshold limit to the hardware register.
> The MCA thresholding registers are accessed in two passes: first during
> per-CPU init, and second when the MCE subsystem devices are created.
>
> The hardware registers are updated in the first pass, and they are left
> as-is in the second pass assuming no configuration has changed. That's
> why there isn't a "reset" in the second pass.
>
> The APEI tables are parsed between the first and second passes. So now
> we need to update the registers during the second pass to apply the
> value found from HEST.
So APEI is initialized as part of the subsys_initcall which is processed
via:
start_kernel
rest_init
kernel_init
kernel_init_freeable
do_basic_setup
do_initcalls
And the first mce_threshold_block_init() happens from :
start_kernel
arch_cpu_finalize_init <---- way before rest_init()
identify_boot_cpu
identify_cpu
mcheck_cpu_init
mcheck_cpu_init_vendor
mce_amd_feature_init
prepare_threshold_block
mce_threshold_block_init
Finally the per-cpu hotplug callback is installed via:
mcheck_init_device <- initiated from a device_initcall, happens after
APEI subsys init.
mce_cpu_online - called on every cpu from the HP machinery
mce_threshold_create_device
threshold_create_bank
allocate_threshold_blocks
mce_threshold_block_init - newly added call in alloc block, used to
update the register with the new limit
Given that mce_cpu_online is already called on every online cpu at the
time of installation of the callback, and every subsequent cpu that will
come online I can't help but wonder why do we have to do the mce
initialization from start_kernel, can't we move the mcheck_cpu_init call
into mce_cpu_online, or have the latter subsume the former?
Sorry if I'm being too nitpicky, I just want to have proper
understanding of the subsystem and the various (implicit) requirements
it has.
At the very least I believe this commit message should at least allude
to the fact that mce threshold devices have a strict requirement to be
created after the APEI subsys has been created.
>
> Thanks,
> Yazen
Powered by blists - more mailing lists