lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20260207115142.GBaYcnTp7maUDVv3Nc@fat_crate.local>
Date: Sat, 7 Feb 2026 12:51:42 +0100
From: Borislav Petkov <bp@...en8.de>
To: "Luck, Tony" <tony.luck@...el.com>
Cc: "Li, Rongqing" <lirongqing@...du.com>,
	Nikolay Borisov <nik.borisov@...e.com>,
	Thomas Gleixner <tglx@...nel.org>, Ingo Molnar <mingo@...hat.com>,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	"x86@...nel.org" <x86@...nel.org>,
	"H . Peter Anvin" <hpa@...or.com>,
	Yazen Ghannam <yazen.ghannam@....com>,
	"Zhuo, Qiuxu" <qiuxu.zhuo@...el.com>,
	Avadhut Naik <avadhut.naik@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>
Subject: Re: [PATCH] x86/mce: Fix timer interval adjustment after logging a
 MCE event

On Wed, Jan 14, 2026 at 02:50:34PM +0100, Borislav Petkov wrote:
> On Tue, Jan 13, 2026 at 04:30:08PM -0800, Luck, Tony wrote:
> > Seems to work (though you've deleted all the places where mce_need_notify
> > is used, so you can also delete the declaration.
> 
> Right.
> 
> > I see time delta between logs reducing while I'm injecting errors.
> > 
> > When I pause injection for several minutes, and then restart I see the
> > interval went back up again.
> 
> Thanks Tony, I'll play with this too and ponder over what would be the proper
> fix which to take to stable too.

Hmm, so looking at this more while it is all peaceful and I can actually hear
the thoughts in my head... :-)

The whole dance here on the MCE logging path:

mce_log -> ... mce_irq_work -> ... mce_work -> mce_gen_pool_process

can happen in between two mce_timer_fn() function firings - just think of
the default timer running once every 5 mins.

So in-between those runs with 5 min timeout, errors can get logged and when
mce_notify_irq() runs, it won't see either that the genpool is not empty
- it will be empty - and mce_need_notify will be 0 too because we would've
set and cleared it. 

So basically, the timer fires, we log errors without it noticing anything, and
it won't halve.

The only way it would halve is if it manages to notice an error being
in-flight to being logged and it fires right then and there. Then its interval
would get halved.

And this sounds kinda weird and not what we want perhaps.

So fixing that would mean, we'd have to write down the fact that in-between
two timer invocations, we have logged errors. Maybe a per-CPU counter
somewhere which says "this CPU logged so many errors after the timer ran the
last time".

The timer would fire, check that counter for != 0, and if so, decrease
interval and clear it.

And it doesn't even have to be a counter - it suffices to be a single bit
which gets set.

A scheme like that would solve this accurately I'd say.

But the real question actually is, do we really care?

I mean, this thing went unnoticed for so long and frankly, people should run
the CEC anyway which has a better MCE-has-been-logged stifling capability so
that I wanna say, let's do the simplest thing and be done with it.

Or?

Do we care about some real use case here...?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ