lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYobX83_0kElO3NZ@agluck-desk3>
Date: Mon, 9 Feb 2026 09:37:35 -0800
From: "Luck, Tony" <tony.luck@...el.com>
To: Borislav Petkov <bp@...en8.de>
CC: "Li, Rongqing" <lirongqing@...du.com>, Nikolay Borisov
	<nik.borisov@...e.com>, Thomas Gleixner <tglx@...nel.org>, Ingo Molnar
	<mingo@...hat.com>, Dave Hansen <dave.hansen@...ux.intel.com>,
	"x86@...nel.org" <x86@...nel.org>, "H . Peter Anvin" <hpa@...or.com>, "Yazen
 Ghannam" <yazen.ghannam@....com>, "Zhuo, Qiuxu" <qiuxu.zhuo@...el.com>,
	Avadhut Naik <avadhut.naik@....com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linux-edac@...r.kernel.org"
	<linux-edac@...r.kernel.org>
Subject: Re: [PATCH] x86/mce: Fix timer interval adjustment after logging a
 MCE event

On Sat, Feb 07, 2026 at 12:51:42PM +0100, Borislav Petkov wrote:
> On Wed, Jan 14, 2026 at 02:50:34PM +0100, Borislav Petkov wrote:
> > On Tue, Jan 13, 2026 at 04:30:08PM -0800, Luck, Tony wrote:
> > > Seems to work (though you've deleted all the places where mce_need_notify
> > > is used, so you can also delete the declaration.
> > 
> > Right.
> > 
> > > I see time delta between logs reducing while I'm injecting errors.
> > > 
> > > When I pause injection for several minutes, and then restart I see the
> > > interval went back up again.
> > 
> > Thanks Tony, I'll play with this too and ponder over what would be the proper
> > fix which to take to stable too.
> 
> Hmm, so looking at this more while it is all peaceful and I can actually hear
> the thoughts in my head... :-)
> 
> The whole dance here on the MCE logging path:
> 
> mce_log -> ... mce_irq_work -> ... mce_work -> mce_gen_pool_process
> 
> can happen in between two mce_timer_fn() function firings - just think of
> the default timer running once every 5 mins.
> 
> So in-between those runs with 5 min timeout, errors can get logged and when
> mce_notify_irq() runs, it won't see either that the genpool is not empty
> - it will be empty - and mce_need_notify will be 0 too because we would've
> set and cleared it. 

The algorithm to halve the interval when errors are found, and double it
when they are not found was orginally for a "poll-only" configuration.
So there wasn't an option for an error to be logged between timer
invocations. This all dates back to before #MC was recoverable.

If the system is now running in some mixed mode of polling and
interrupts, then it is unclear what should be done in various
new cases.

> 
> So basically, the timer fires, we log errors without it noticing anything, and
> it won't halve.
> 
> The only way it would halve is if it manages to notice an error being
> in-flight to being logged and it fires right then and there. Then its interval
> would get halved.
> 
> And this sounds kinda weird and not what we want perhaps.
> 
> So fixing that would mean, we'd have to write down the fact that in-between
> two timer invocations, we have logged errors. Maybe a per-CPU counter
> somewhere which says "this CPU logged so many errors after the timer ran the
> last time".
> 
> The timer would fire, check that counter for != 0, and if so, decrease
> interval and clear it.
> 
> And it doesn't even have to be a counter - it suffices to be a single bit
> which gets set.
> 
> A scheme like that would solve this accurately I'd say.
> 
> But the real question actually is, do we really care?

I don't think we care. If we miss out halving the interval becuause an
error was logged between timer based polling, nothing really bad will
happen. The interval might get sorted out on the next interval.

> I mean, this thing went unnoticed for so long and frankly, people should run
> the CEC anyway which has a better MCE-has-been-logged stifling capability so
> that I wanna say, let's do the simplest thing and be done with it.
> 
> Or?
> 
> Do we care about some real use case here...?
> 

Unless someone has a real world case where something is going badly
wrong, then I don't think any changes are needed to cover this race.

-Tony

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ