[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250922135824.GA31279@yaz-khff2.amd.com>
Date: Mon, 22 Sep 2025 09:58:24 -0400
From: Yazen Ghannam <yazen.ghannam@....com>
To: Nikolay Borisov <nik.borisov@...e.com>
Cc: x86@...nel.org, Tony Luck <tony.luck@...el.com>,
"Rafael J. Wysocki" <rafael@...nel.org>,
linux-kernel@...r.kernel.org, linux-edac@...r.kernel.org,
Smita.KoralahalliChannabasappa@....com,
Qiuxu Zhuo <qiuxu.zhuo@...el.com>, linux-acpi@...r.kernel.org
Subject: Re: [PATCH v6 15/15] x86/mce: Save and use APEI corrected threshold
limit
On Fri, Sep 19, 2025 at 01:42:57PM +0300, Nikolay Borisov wrote:
>
>
> On 9/15/25 20:33, Yazen Ghannam wrote:
> > On Thu, Sep 11, 2025 at 08:01:17PM +0300, Nikolay Borisov wrote:
> > >
> > >
> > > On 9/8/25 18:40, Yazen Ghannam wrote:
> > > > The MCA threshold limit generally is not something that needs to change
> > > > during runtime. It is common for a system administrator to decide on a
> > > > policy for their managed systems.
> > > >
> > > > If MCA thresholding is OS-managed, then the threshold limit must be set
> > > > at every boot. However, many systems allow the user to set a value in
> > > > their BIOS. And this is reported through an APEI HEST entry even if
> > > > thresholding is not in FW-First mode.
> > > >
> > > > Use this value, if available, to set the OS-managed threshold limit.
> > > > Users can still override it through sysfs if desired for testing or
> > > > debug.
> > > >
> > > > APEI is parsed after MCE is initialized. So reset the thresholding
> > > > blocks later to pick up the threshold limit.
> > > >
> > > > Signed-off-by: Yazen Ghannam <yazen.ghannam@....com>
> > > > ---
> > > >
> > > > Notes:
> > > > Link:
> > > > https://lore.kernel.org/r/20250825-wip-mca-updates-v5-20-865768a2eef8@amd.com
> > > > v5->v6:
> > > > * No change.
> > > > v4->v5:
> > > > * No change.
> > > > v3->v4:
> > > > * New in v4.
> > > >
> > > > arch/x86/include/asm/mce.h | 6 ++++++
> > > > arch/x86/kernel/acpi/apei.c | 2 ++
> > > > arch/x86/kernel/cpu/mce/amd.c | 18 ++++++++++++++++--
> > > > arch/x86/kernel/cpu/mce/internal.h | 2 ++
> > > > arch/x86/kernel/cpu/mce/threshold.c | 13 +++++++++++++
> > > > 5 files changed, 39 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> > > > index 7d6588195d56..1cfbfff0be3f 100644
> > > > --- a/arch/x86/include/asm/mce.h
> > > > +++ b/arch/x86/include/asm/mce.h
> > > > @@ -308,6 +308,12 @@ DECLARE_PER_CPU(struct mce, injectm);
> > > > /* Disable CMCI/polling for MCA bank claimed by firmware */
> > > > extern void mce_disable_bank(int bank);
> > > > +#ifdef CONFIG_X86_MCE_THRESHOLD
> > > > +void mce_save_apei_thr_limit(u32 thr_limit);
> > > > +#else
> > > > +static inline void mce_save_apei_thr_limit(u32 thr_limit) { }
> > > > +#endif /* CONFIG_X86_MCE_THRESHOLD */
> > > > +
> > > > /*
> > > > * Exception handler
> > > > */
> > > > diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c
> > > > index 0916f00a992e..e21419e686eb 100644
> > > > --- a/arch/x86/kernel/acpi/apei.c
> > > > +++ b/arch/x86/kernel/acpi/apei.c
> > > > @@ -19,6 +19,8 @@ int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data)
> > > > if (!cmc->enabled)
> > > > return 0;
> > > > + mce_save_apei_thr_limit(cmc->notify.error_threshold_value);
> > > > +
> > > > /*
> > > > * We expect HEST to provide a list of MC banks that report errors
> > > > * in firmware first mode. Otherwise, return non-zero value to
> > > > diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
> > > > index b895559e80ad..9b746080351f 100644
> > > > --- a/arch/x86/kernel/cpu/mce/amd.c
> > > > +++ b/arch/x86/kernel/cpu/mce/amd.c
> > > > @@ -489,6 +489,18 @@ static void threshold_restart_bank(unsigned int bank, bool intr_en)
> > > > }
> > > > }
> > > > +/* Try to use the threshold limit reported through APEI. */
> > > > +static u16 get_thr_limit(void)
> > > > +{
> > > > + u32 thr_limit = mce_get_apei_thr_limit();
> > > > +
> > > > + /* Fallback to old default if APEI limit is not available. */
> > > > + if (!thr_limit)
> > > > + return THRESHOLD_MAX;
> > > > +
> > > > + return min(thr_limit, THRESHOLD_MAX);
> > > > +}
> > > > +
> > > > static void mce_threshold_block_init(struct threshold_block *b, int offset)
> > > > {
> > > > struct thresh_restart tr = {
> > > > @@ -497,7 +509,7 @@ static void mce_threshold_block_init(struct threshold_block *b, int offset)
> > > > .lvt_off = offset,
> > > > };
> > > > - b->threshold_limit = THRESHOLD_MAX;
> > > > + b->threshold_limit = get_thr_limit();
> > > > threshold_restart_block(&tr);
> > > > };
> > > > @@ -1071,7 +1083,7 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
> > > > b->address = address;
> > > > b->interrupt_enable = 0;
> > > > b->interrupt_capable = lvt_interrupt_supported(bank, high);
> > > > - b->threshold_limit = THRESHOLD_MAX;
> > > > + b->threshold_limit = get_thr_limit();
> > > > if (b->interrupt_capable) {
> > > > default_attrs[2] = &interrupt_enable.attr;
> > > > @@ -1082,6 +1094,8 @@ static int allocate_threshold_blocks(unsigned int cpu, struct threshold_bank *tb
> > > > list_add(&b->miscj, &tb->miscj);
> > > > + mce_threshold_block_init(b, (high & MASK_LVTOFF_HI) >> 20);
> > >
> > > Why is this necessary? Shouldn't this patch consist of mainly
> > > s/THRESHOLD_MAX/get_thr_limit();
> > >
> > >
> > > In allocate_threshold_block have already properly set threshold_limit. So
> > > this change really ensures threshold_restart_block is being called for the
> > > given block being initialized. Ignoring the changed threshold limit logic,
> > > why is this extra call necessary now and wasn't before?
> > >
> >
> > It is necessary to apply the threshold limit to the hardware register.
> > The MCA thresholding registers are accessed in two passes: first during
> > per-CPU init, and second when the MCE subsystem devices are created.
> >
> > The hardware registers are updated in the first pass, and they are left
> > as-is in the second pass assuming no configuration has changed. That's
> > why there isn't a "reset" in the second pass.
> >
> > The APEI tables are parsed between the first and second passes. So now
> > we need to update the registers during the second pass to apply the
> > value found from HEST.
>
> So APEI is initialized as part of the subsys_initcall which is processed
> via:
>
> start_kernel
> rest_init
> kernel_init
> kernel_init_freeable
> do_basic_setup
> do_initcalls
>
> And the first mce_threshold_block_init() happens from :
>
> start_kernel
> arch_cpu_finalize_init <---- way before rest_init()
> identify_boot_cpu
> identify_cpu
> mcheck_cpu_init
> mcheck_cpu_init_vendor
> mce_amd_feature_init
> prepare_threshold_block
> mce_threshold_block_init
>
>
> Finally the per-cpu hotplug callback is installed via:
>
> mcheck_init_device <- initiated from a device_initcall, happens after APEI
> subsys init.
> mce_cpu_online - called on every cpu from the HP machinery
> mce_threshold_create_device
> threshold_create_bank
> allocate_threshold_blocks
> mce_threshold_block_init - newly added call in alloc block, used to update
> the register with the new limit
>
>
> Given that mce_cpu_online is already called on every online cpu at the time
> of installation of the callback, and every subsequent cpu that will come
> online I can't help but wonder why do we have to do the mce initialization
> from start_kernel, can't we move the mcheck_cpu_init call into
> mce_cpu_online, or have the latter subsume the former?
>
> Sorry if I'm being too nitpicky, I just want to have proper understanding of
> the subsystem and the various (implicit) requirements it has.
>
No worries. It is a bit convoluted. I'd like to clean this up
eventually. I have an old attempt here:
https://github.com/AMDESE/linux/commit/640db92eca07804e889fac88856904552a4466bd
In general, I'd like to do the hardware and data structure init in a
single pass. And the sysfs interface can be optionally added separately.
>
> At the very least I believe this commit message should at least allude to
> the fact that mce threshold devices have a strict requirement to be created
> after the APEI subsys has been created.
>
>
The commit message has this:
"APEI is parsed after MCE is initialized. So reset the thresholding
blocks later to pick up the threshold limit."
The "devices" are already created after APEI subsys init. The missing
step is updating the hardware registers.
Thanks,
Yazen
Powered by blists - more mailing lists