[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231114192324.GAZVPJLGZmfJBS181/@fat_crate.local>
Date: Tue, 14 Nov 2023 20:23:24 +0100
From: Borislav Petkov <bp@...en8.de>
To: Tony Luck <tony.luck@...el.com>
Cc: Yazen Ghannam <yazen.ghannam@....com>,
Smita.KoralahalliChannabasappa@....com,
dave.hansen@...ux.intel.com, x86@...nel.org,
linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org,
patches@...ts.linux.dev
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation
On Mon, Oct 23, 2023 at 11:14:04AM -0700, Tony Luck wrote:
> I want to track whether each bank is in storm mode, or not. But there's
> no indication when a CMCI is delivered which bank was the source. Code
> just has to scan all the banks, and might find more than one with an
> error. While no bank is in polling mode, there isn't a set time interval
> between scanning banks.
Well, if no bank is in *storm* polling mode - I presume you mean that
when you say "polling mode" - then we have the default polling interval
of machine_check_poll() of INITIAL_CHECK_INTERVAL, i.e., 5 mins, I'd
say.
> A scan is just triggered when a CMCI happens. So it's non-trivial to
> compute a rate. Would require saving a timestamp for every logged
> error.
So what I'm trying to establish first is, what our entry vectors into
the storm code are.
1. You can enter storm tracking when you poll normally. I.e.,
machine_check_poll() each 5 mins once.
mce_track_storm() tracks history only for MCI_STATUS_VAL=1b CEs, which
is what I was wondering. It is hidden a bit down in that function.
2. As a result of a CMCI interrupt. That will call machine_check_poll()
too and go down the same path.
So that flow should be called out in the commit message so that the big
picture is clear - how we're doing that storm tracking.
Now, what is protecting this against concurrent runs of
machine_check_poll()? Imagine the timer fires and a CMCI happens at the
same time and on the same core.
> In a simple case there's just one bank responsible for a ton of CMCI.
> No need for complexity here, the count of interrupts from that bank will
> hit a threshold and a storm is declared.
Right.
> But more complex scenarois are possible. Other banks may trigger small
> numbers of CMCI. Not enough to call it a storm. Or multiple banks may
> be screaming together.
>
> By tracking both the hits and misses in each bank, I end up with a
> bitmap history for the past 64 polls. If there are enough "1" bits in
> that bitmap to meet the threshold, then declare a storm for that
> bank.
Yap, I only want to be crystal clear on the flow and the entry points.
> I need to stare at this again to refresh my memory of what's going on
> here. This code may need pulling apart into a routine that is used for
> systems with no CMCI (or have CMCI disabled). Then the whole "divide the
> poll interval by two" when you see an error and double the interval
> when you don't see an error makes sense.
>
> For systems with CMCI ... I think just polling a one second interval
> until the storm is over makes sense.
Ok.
> These are only used in threshold.c now. What's the point of them
> being in internal.h. That's for defintiones shared by multiple
> mcs/*.c files. Isn't it? But will move there if you still want this.
Structs hidden in .c files looks weird but ok.
> Ideally the new CPU would inherit the precise state of the previous
> owner of this bank. But there's no easy way to track that as the bank
> is abanoned by the CPU going offline, and there is a free-for-all with
> remaining CPUs racing to claim ownership. It is known that this bank
> was in storm mode (because the threshold in the CTL2 bank register is
> set to CMCI_STORM_THRESHOLD).
>
> I went with "worst case" to make sure the new CPU didn't prematurely
> declare an end to the storm.
>
> I'll add a comment in mce_inherit_storm() to explain this.
Yap, exactly.
> Like this?
>
> #define NUM_HISTORY_BITS (sizeof(u64) * BITS_PER_BYTE)
>
> if (shift < NUM_HISTORY_BITS)
Yap.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists