[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <984ee4ab-6e6b-cb0e-a4f1-ce2951994b1d@nurealm.net>
Date: Wed, 19 May 2021 14:03:05 -0600
From: James Feeney <james@...ealm.net>
To: Borislav Petkov <bp@...e.de>
Cc: linux-smp@...r.kernel.org, Jens Axboe <axboe@...nel.dk>,
lkml <linux-kernel@...r.kernel.org>
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! -
RIP smp_call_function_single
On 5/19/21 5:12 AM, Borislav Petkov wrote:
> On Tue, May 18, 2021 at 09:58:46PM -0600, James Feeney wrote:
>> Hmm - I am naively supposing that "the bisect is the bisect". No
>> matter what commit initiates a problem, it's still a problem. It would
>> be useful to investigate, and introspect the calling functions in the
>> Call Trace. No?
>
> I'd like to know that the source you're looking at is the same source
> I'm looking at.
>
> And yes, AFAIK, Arch kernels are simply the upstream kernels but
> still...
>
I had to ask, and got this answer:
====
The sources contain commits on top of upstream releases. This is why the tags contain -arch1 etc. For example, see https://git.archlinux.org/linux.git/log/?h=v5.11.16-arch1 , which adds 6 commits on top of the upstream "Linux 5.11.16" release, while https://git.archlinux.org/linux.git/log/?h=v5.12-arch1 only contains the long-standing "unprivileged_userns_clone" patch and the version number change, making it essentially vanilla.
====
There are no additional kernel patches in the build.
>> Attached:
>> bootlog.7bb39313cd62
>> bootlog.4f432e8bb15b
>>
>> The later with the "soft lockup" repeating four times. The kernel
>> command line has loglevel=5 and console=ttyS0,115200.
>
> Those are not the full boot messages - they should look like
> dmesglog.7bb39313cd62 but probably you cannot log into the box after the
> softlockup happens to dump them. That's why I meant to try the serial
> connection...
>
> Anyway, let's start somewhere.
>
> 1. Take a pristine 5.12 upstream kernel from git, build it using your
> bisectconfig and try booting it with
>
> debug ignore_loglevel log_buf_len=16M no_console_suspend systemd.log_target=null console=ttyS0,115200 console=tty0
>
> on the kernel command line. Then save a full dmesg, if you can. If you
> ocan catch ot ver serial, then that would be awesomer.
>
> 2. Use the exact same kernel but this time disable
>
> CONFIG_X86_THERMAL_VECTOR
>
> in its .config and do the same thing.
>
> Send me both dmesg files then.
>
> Thx.
>
$ git bisect reset v5.12-arch1
Updating files: 100% (12812/12812), done.
Previous HEAD position was 7bb39313cd62 x86/mce: Make mce_timed_out() identify holdout CPUs
HEAD is now at bee4e691ceea Arch Linux kernel v5.12-arch1
$ grep CONFIG_X86_THERMAL_VECTOR .config
CONFIG_X86_THERMAL_VECTOR=y
Attached:
dmesglog.5.12.therm.1.nostart
hangs after unpack rootfs
dmesglog.5.12.therm.2.softlockup
soft lockup, but stops and does not repeat
dmesglog.5.12.therm.3.fullboot
boots all the way to Xorg and does run a browser and play video
The fourth boot attempt hung again at unpack rootfs. If the machine is let sit in this state, the fan will begin to run full, off and on, suggesting that maybe the processor is still running and running full power.
These boots are consecutive and are all from the same stock 5.12.0 kernel.
> Use the exact same kernel but this time disable CONFIG_X86_THERMAL_VECTOR
$ make menuconfig
...
This config option is not listed and is not changeable:
====
drivers/thermal/intel/Kconfig
config X86_THERMAL_VECTOR
def_bool y
depends on X86 && CPU_SUP_INTEL && X86_LOCAL_APIC
====
The Makefile there has:
obj-$(CONFIG_X86_THERMAL_VECTOR) += therm_throt.o
The files, thermal_interrupt.h and therm_throt.c, by Dmitriy Zavin, are new since 5.11. But, it seems that this therm_throt.c file is one of yours, anyway:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/thermal/intel/therm_throt.c?h=linux-5.12.y&id=9223d0dccb8f8523754122f68316dd1a4f39f7f8
I'm not sure that I can just delete these files, being thermal management and all.
I see some talk in the associated thread about IRQ handler registration. Could there be some connection between this and the soft lockup?
https://lore.kernel.org/linux-pm/20210201142704.12495-1-bp@alien8.de/
What should we do next?
James
View attachment "dmesglog.5.12.therm.1.nostart" of type "text/plain" (38255 bytes)
View attachment "dmesglog.5.12.therm.2.softlockup" of type "text/plain" (71124 bytes)
View attachment "dmesglog.5.12.therm.3.fullboot" of type "text/plain" (69356 bytes)
Powered by blists - more mailing lists