[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1876afbe-a167-2be5-3690-846700eeb76c@nurealm.net>
Date: Tue, 18 May 2021 21:58:46 -0600
From: James Feeney <james@...ealm.net>
To: Borislav Petkov <bp@...e.de>
Cc: linux-smp@...r.kernel.org, Jens Axboe <axboe@...nel.dk>,
lkml <linux-kernel@...r.kernel.org>
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! -
RIP smp_call_function_single
On 5/17/21 2:32 AM, Borislav Petkov wrote:
> + lkml.
>
> On Mon, May 17, 2021 at 02:13:45AM -0600, James Feeney wrote:
>> I re-ran my git bisect, this time with a full power-down and cold boot, and more thorough testing, running a web browser. My second bisect went from good to bad.
>>
>> So now, instead, git bisect ended here:
>>
>> 4f432e8bb15b352da72525144da025a46695968f is the first bad commit
>> commit 4f432e8bb15b352da72525144da025a46695968f
>> Author: Borislav Petkov <bp@...e.de>
>> Date: Thu Jan 7 13:23:34 2021 +0100
>>
>> x86/mce: Get rid of mcheck_intel_therm_init()
>>
>> Move the APIC_LVTTHMR read which needs to happen on the BSP, to
>> intel_init_thermal(). One less boot dependency.
>>
>> No functional changes.
>>
>> Signed-off-by: Borislav Petkov <bp@...e.de>
>> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>
>> Link: https://lkml.kernel.org/r/20210201142704.12495-2-bp@alien8.de
>>
>> arch/x86/include/asm/mce.h | 6 ------
>> arch/x86/kernel/cpu/mce/core.c | 1 -
>> arch/x86/kernel/cpu/mce/therm_throt.c | 15 ++++-----------
>> 3 files changed, 4 insertions(+), 18 deletions(-)
>>
>>
>> Please let me know if that makes more sense.
>
> Not really - this is the first time I'm seeing this and I highly doubt
> your bisection is correct. But we'll see.>
I did go back and repeat the git bisect for a third time. This time, I re-booted all of the "good" kernels 10 times, in case there was some random probability that a "good" kernel "just got lucky", and failed to produce an error on that boot. There were *no* boot failures on the "good" kernels, and there was *no change* in the resulting final "bad" commit.
>>
>> Again:
>>
>> Arch Linux
>> linux 5.12.arch1-1
>
> Can you reproduce with the upstream 5.12 kernel to rule out influence by
> any distro-specific patches?
>
Hmm - I am naively supposing that "the bisect is the bisect". No matter what commit initiates a problem, it's still a problem. It would be useful to investigate, and introspect the calling functions in the Call Trace. No?
>> Intel Core2 T7200
>> Mobile Intel 945PM Express Chipset
>> ICH7-M
>> Mobility Radeon X1600
>
> Can you send full dmesg from a working kernel and the .config you're
> using with 5.12?
>
Attached:
dmesglog.7bb39313cd62
bisectconfig
7bb39313cd62 x86/mce: Make mce_timed_out() identify holdout CPUs
4f432e8bb15b x86/mce: Get rid of mcheck_intel_therm_init()
7bb39313cd62 is the immediately previous "good" bisect kernel. The config files for the two kernels is exactly the same.
>> Generally, on failure, the system will not boot past "Loading initial ramdisk...", or, when it does, the boot process will hang, and the console will eventually show:
>>
>> watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [systemd-udevd: 241]
>> ...
>> RIP: 0010:smp_call_function_single+0xf7/0x140
>>
>> The top of the call trace variously shows either "__flush_tlb_all" or "tlbflush_read_file", with the "soft lockup" repeating indefinitely.
>>
>
> I'm presuming there's no way to connect your box over serial cable to
> another one so that you can catch the full bad dmesg when it hangs? It
> would be good if you could...
>
Attached:
bootlog.7bb39313cd62
bootlog.4f432e8bb15b
The later with the "soft lockup" repeating four times. The kernel command line has loglevel=5 and console=ttyS0,115200.
> Thx.
>
Thanks for looking into this. Would some additional printk's be useful?
James
View attachment "dmesglog.7bb39313cd62" of type "text/plain" (72375 bytes)
View attachment "bisectconfig" of type "text/plain" (237378 bytes)
View attachment "bootlog.7bb39313cd62" of type "text/plain" (12778 bytes)
View attachment "bootlog.4f432e8bb15b" of type "text/plain" (20300 bytes)
Powered by blists - more mailing lists