lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e7701de5-35f3-da9d-7339-df2de6d8b3cf@nurealm.net>
Date:   Wed, 19 May 2021 21:12:04 -0600
From:   James Feeney <james@...ealm.net>
To:     Borislav Petkov <bp@...e.de>
Cc:     linux-smp@...r.kernel.org, Jens Axboe <axboe@...nel.dk>,
        lkml <linux-kernel@...r.kernel.org>
Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! -
 RIP smp_call_function_single

On 5/19/21 3:18 PM, Borislav Petkov wrote:
> On Wed, May 19, 2021 at 02:03:05PM -0600, James Feeney wrote:
>> I had to ask, and got this answer:
>>
>> ====
>> The sources contain commits on top of upstream releases. This is why the tags contain -arch1 etc. For example, see https://git.archlinux.org/linux.git/log/?h=v5.11.16-arch1 , which adds 6 commits on top of the upstream "Linux 5.11.16" release, while https://git.archlinux.org/linux.git/log/?h=v5.12-arch1 only contains the long-standing "unprivileged_userns_clone" patch and the version number change, making it essentially vanilla.
>> ====
>>
>> There are no additional kernel patches in the build.
> 
> Yeah, ok, let's say you're running pretty much an upstream kernel.
> 
>> These boots are consecutive and are all from the same stock 5.12.0 kernel.
> 
> Yeah, that's weird. Box seems to boot fine in some cases. There's
> 
> [   26.864040] ACPI: OSL: Resource conflict; ACPI support missing from driver?
> [   26.874541] ACPI: OSL: Resource conflict: System may be unstable or behave erratically
> 
> Dunno how relevant this is - it probably has been the case since forever. 
> 

That's, in part, from the kernel command line "acpi_enforce_resources=lax".  Without that, the RAM SPD EEPOMs cannot be accessed.  And, the eeprom module is deprecated, being replaced by the at24 module.

>> $ make menuconfig
>> ...
>>
>> This config option is not listed and is not changeable:
> 
> Bah, sorry about that. Use this small hunk ontop so that you can disable
> it in menuconfig:
> 
> ---
> diff --git a/drivers/thermal/intel/Kconfig b/drivers/thermal/intel/Kconfig
> index ce4f59213c7a..8c1376a20b54 100644
> --- a/drivers/thermal/intel/Kconfig
> +++ b/drivers/thermal/intel/Kconfig
> @@ -9,7 +9,8 @@ config INTEL_POWERCLAMP
>  	  user interface is exposed via generic thermal framework.
>  
>  config X86_THERMAL_VECTOR
> -	def_bool y
> +	tristate "X86 thermal vector"
> +	default y
>  	depends on X86 && CPU_SUP_INTEL && X86_LOCAL_APIC
>  
>  config X86_PKG_TEMP_THERMAL
> ---
> 
> and with it ontop, boot it the aforementioned way.
> 

Ok - I didn't want to do that without some idea that it would not disable the whole thermal management process.

$ make menuconfig
...

$ diff .config .config.old
4983c4983,4984
< # CONFIG_X86_THERMAL_VECTOR is not set
---
> CONFIG_X86_THERMAL_VECTOR=y
> CONFIG_X86_PKG_TEMP_THERMAL=m

No joy.  Still have the same soft lockups and full boots - the full boots interrupted by some mystery delay.

Attached:
 dmesglog.5.12.NoTherm.1.softlockup

Out of curiosity, I also tried removing the "acpi_enforce_resources=lax" configuration, but that makes no difference, other than removing the "System may be unstable or behave erratically" warnings.

Attached:
 dmesglog.5.12.NoTherm.2.notlax.fullboot
 dmesglog.5.12.NoTherm.3.notlax.softlockup


I don't know about these patches, modifying and moving the location of therm_throt.c, so I'm not in a position to draw any conclusion from these results.


> And then pls build 5.11 which doesn't have the therm_throt patch - I'd
> like to see if that therm_throt thing is even loading. Because I don't
> see something like:
> 
> [    0.302411] mce: CPU0: Thermal monitoring enabled (TM2)
> 
> in your dmesg. Could be some detection failure due to the change when
> those happen. Althouth they shouldn't but who knows.
> 

build 5.11?  There are lots of 5.11 kernels from the Arch distribution that I have run.  Are you looking for a dmesg log from 5.11?

I'm confused.  There is the similar arch/x86/kernel/cpu/mce/therm_throt.c in the 5.11 kernel.  What is going to be different?  Even back to the 5.10 kernels, in the dmesg log, all the references to the thermal governors and the thermal zones are the same.


> I have a similar box as yours:
> 
> [    0.305858] smpboot: CPU0: Intel(R) Pentium(R) Dual  CPU  E2180  @ 2.00GHz (family: 0x6, model: 0xf, stepping: 0xd)
> 
> a bit newer stepping and AFAIR, it boots fine but I'll try your .config
> there to check just in case.
> 
> Thx.
> 
> 

So far, something looks quirky - somewhere.  Timing related failures can be a pain.  Is there no useful information being provided by the Call Trace in the dmesg log?


James

View attachment "dmesglog.5.12.NoTherm.1.softlockup" of type "text/plain" (81217 bytes)

View attachment "dmesglog.5.12.NoTherm.2.notlax.fullboot" of type "text/plain" (69225 bytes)

View attachment "dmesglog.5.12.NoTherm.3.notlax.softlockup" of type "text/plain" (96210 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ