[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e7701de5-35f3-da9d-7339-df2de6d8b3cf@nurealm.net>
Date: Wed, 19 May 2021 21:12:04 -0600
From: James Feeney <james@...ealm.net>
To: Borislav Petkov <bp@...e.de>
Cc: linux-smp@...r.kernel.org, Jens Axboe <axboe@...nel.dk>,
lkml <linux-kernel@...r.kernel.org>
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! -
RIP smp_call_function_single
On 5/19/21 3:18 PM, Borislav Petkov wrote:
> On Wed, May 19, 2021 at 02:03:05PM -0600, James Feeney wrote:
>> I had to ask, and got this answer:
>>
>> ====
>> The sources contain commits on top of upstream releases. This is why the tags contain -arch1 etc. For example, see https://git.archlinux.org/linux.git/log/?h=v5.11.16-arch1 , which adds 6 commits on top of the upstream "Linux 5.11.16" release, while https://git.archlinux.org/linux.git/log/?h=v5.12-arch1 only contains the long-standing "unprivileged_userns_clone" patch and the version number change, making it essentially vanilla.
>> ====
>>
>> There are no additional kernel patches in the build.
>
> Yeah, ok, let's say you're running pretty much an upstream kernel.
>
>> These boots are consecutive and are all from the same stock 5.12.0 kernel.
>
> Yeah, that's weird. Box seems to boot fine in some cases. There's
>
> [ 26.864040] ACPI: OSL: Resource conflict; ACPI support missing from driver?
> [ 26.874541] ACPI: OSL: Resource conflict: System may be unstable or behave erratically
>
> Dunno how relevant this is - it probably has been the case since forever.
>
That's, in part, from the kernel command line "acpi_enforce_resources=lax". Without that, the RAM SPD EEPOMs cannot be accessed. And, the eeprom module is deprecated, being replaced by the at24 module.
>> $ make menuconfig
>> ...
>>
>> This config option is not listed and is not changeable:
>
> Bah, sorry about that. Use this small hunk ontop so that you can disable
> it in menuconfig:
>
> ---
> diff --git a/drivers/thermal/intel/Kconfig b/drivers/thermal/intel/Kconfig
> index ce4f59213c7a..8c1376a20b54 100644
> --- a/drivers/thermal/intel/Kconfig
> +++ b/drivers/thermal/intel/Kconfig
> @@ -9,7 +9,8 @@ config INTEL_POWERCLAMP
> user interface is exposed via generic thermal framework.
>
> config X86_THERMAL_VECTOR
> - def_bool y
> + tristate "X86 thermal vector"
> + default y
> depends on X86 && CPU_SUP_INTEL && X86_LOCAL_APIC
>
> config X86_PKG_TEMP_THERMAL
> ---
>
> and with it ontop, boot it the aforementioned way.
>
Ok - I didn't want to do that without some idea that it would not disable the whole thermal management process.
$ make menuconfig
...
$ diff .config .config.old
4983c4983,4984
< # CONFIG_X86_THERMAL_VECTOR is not set
---
> CONFIG_X86_THERMAL_VECTOR=y
> CONFIG_X86_PKG_TEMP_THERMAL=m
No joy. Still have the same soft lockups and full boots - the full boots interrupted by some mystery delay.
Attached:
dmesglog.5.12.NoTherm.1.softlockup
Out of curiosity, I also tried removing the "acpi_enforce_resources=lax" configuration, but that makes no difference, other than removing the "System may be unstable or behave erratically" warnings.
Attached:
dmesglog.5.12.NoTherm.2.notlax.fullboot
dmesglog.5.12.NoTherm.3.notlax.softlockup
I don't know about these patches, modifying and moving the location of therm_throt.c, so I'm not in a position to draw any conclusion from these results.
> And then pls build 5.11 which doesn't have the therm_throt patch - I'd
> like to see if that therm_throt thing is even loading. Because I don't
> see something like:
>
> [ 0.302411] mce: CPU0: Thermal monitoring enabled (TM2)
>
> in your dmesg. Could be some detection failure due to the change when
> those happen. Althouth they shouldn't but who knows.
>
build 5.11? There are lots of 5.11 kernels from the Arch distribution that I have run. Are you looking for a dmesg log from 5.11?
I'm confused. There is the similar arch/x86/kernel/cpu/mce/therm_throt.c in the 5.11 kernel. What is going to be different? Even back to the 5.10 kernels, in the dmesg log, all the references to the thermal governors and the thermal zones are the same.
> I have a similar box as yours:
>
> [ 0.305858] smpboot: CPU0: Intel(R) Pentium(R) Dual CPU E2180 @ 2.00GHz (family: 0x6, model: 0xf, stepping: 0xd)
>
> a bit newer stepping and AFAIR, it boots fine but I'll try your .config
> there to check just in case.
>
> Thx.
>
>
So far, something looks quirky - somewhere. Timing related failures can be a pain. Is there no useful information being provided by the Call Trace in the dmesg log?
James
View attachment "dmesglog.5.12.NoTherm.1.softlockup" of type "text/plain" (81217 bytes)
View attachment "dmesglog.5.12.NoTherm.2.notlax.fullboot" of type "text/plain" (69225 bytes)
View attachment "dmesglog.5.12.NoTherm.3.notlax.softlockup" of type "text/plain" (96210 bytes)
Powered by blists - more mailing lists