[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260113213158.GUaWa5zunSfuAzra0n@fat_crate.local>
Date: Tue, 13 Jan 2026 22:31:58 +0100
From: Borislav Petkov <bp@...en8.de>
To: "Luck, Tony" <tony.luck@...el.com>
Cc: "Li, Rongqing" <lirongqing@...du.com>,
Nikolay Borisov <nik.borisov@...e.com>,
Thomas Gleixner <tglx@...nel.org>, Ingo Molnar <mingo@...hat.com>,
Dave Hansen <dave.hansen@...ux.intel.com>,
"x86@...nel.org" <x86@...nel.org>,
"H . Peter Anvin" <hpa@...or.com>,
Yazen Ghannam <yazen.ghannam@....com>,
"Zhuo, Qiuxu" <qiuxu.zhuo@...el.com>,
Avadhut Naik <avadhut.naik@....com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>
Subject: Re: 答复: 答复: 答复: [外部邮件] Re: [PATCH] x86/mce: Fix timer interval adjustment after logging a MCE event
On Tue, Jan 13, 2026 at 09:05:01PM +0000, Luck, Tony wrote:
> >> $ dmesg | grep 'Machine Check Event:'
> >
> > Did you see the "Machine check events logged\n" print from mce_notify_irq() in
> > dmesg too?
>
> Yes. I used the other grep pattern to see detail of which CPU/bank logged the error.
> Same pattern of timestamps shows up with this grep too.
Yah, this confirms the flow:
mce_timer_fn()-> ... -> machine_check_poll -> mce_log which will queue the
work and return.
Now, back in mce_timer_fn:
/*
* Alert userspace if needed. If we logged an MCE, reduce the polling
* interval, otherwise increase the polling interval.
*/
if (mce_notify_irq())
<--- we haven't ran the notifier chain yet so mce_need_notify is not set yet
so this won't hit and we won't halve the interval. I need to verify that
empirically.
iv = max(iv / 2, (unsigned long) HZ/100);
else
iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));
And now the notifier chain runs. mce_early_notifier() sets the bit, does
mce_notify_irq(), that clears the bit and then the notifier chain a little
later (skx_edac) logs the error.
So this looks like a silly timing issue...
We could set mce_need_notify in mce_log(), zap this thing:
if (__ratelimit(&ratelimit))
pr_info(HW_ERR "Machine check events logged\n");
in mce_notify_irq() or at least predicate it on the CEC being enabled and then
not call mce_notify_irq() in the notifier but leave it be called in the timer
function...
Ufff, how silly and overengineered we've made it. I need to think about
a cleaner solution tomorrow...
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists