[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210106191708.GB2743@paulmck-ThinkPad-P72>
Date: Wed, 6 Jan 2021 11:17:08 -0800
From: "Paul E. McKenney" <paulmck@...nel.org>
To: "Luck, Tony" <tony.luck@...el.com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"x86@...nel.org" <x86@...nel.org>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"bp@...en8.de" <bp@...en8.de>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"mingo@...hat.com" <mingo@...hat.com>,
"hpa@...or.com" <hpa@...or.com>,
"kernel-team@...com" <kernel-team@...com>
Subject: Re: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs
On Wed, Jan 06, 2021 at 06:39:30PM +0000, Luck, Tony wrote:
> > The "Timeout: Not all CPUs entered broadcast exception handler" message
> > will appear from time to time given enough systems, but this message does
> > not identify which CPUs failed to enter the broadcast exception handler.
> > This information would be valuable if available, for example, in order to
> > correlated with other hardware-oriented error messages. This commit
> > therefore maintains a cpumask_t of CPUs that have entered this handler,
> > and prints out which ones failed to enter in the event of a timeout.
>
> I tried doing this a while back, but found that in my test case where I forced
> an error that would cause both threads from one core to be "missing", the
> output was highly unpredictable. Some random number of extra CPUs were
> reported as missing. After I added some extra breadcrumbs it became clear
> that pretty much all the CPUs (except the missing pair) entered do_machine_check(),
> but some got hung up at various points beyond the entry point. My only theory
> was that they were trying to snoop caches from the dead core (or access some
> other resource held by the dead core) and so they hung too.
>
> Your code is much neater than mine ... and perhaps works in other cases, but
> maybe the message needs to allow for the fact that some of the cores that
> are reported missing may just be collateral damage from the initial problem.
Understood. The system is probably not in the best shape if this code
is ever executed, after all. ;-)
So how about like this?
pr_info("%s: MCE holdout CPUs (may include false positives): %*pbl\n",
Easy enough if so!
> If I get time in the next day or two, I'll run my old test against your code to
> see what happens.
Thank you very much in advance!
For my own testing, is this still the right thing to use?
https://github.com/andikleen/mce-inject
Thanx, Paul
Powered by blists - more mailing lists