linux-kernel - Re: kdump kernel randomly hang with tick_periodic call trace on bare metal system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y6PYYUKXR2OCH3WG@MiWiFi-R3L-srv>
Date:   Thu, 22 Dec 2022 12:09:05 +0800
From:   Baoquan He <bhe@...hat.com>
To:     "Guilherme G. Piccoli" <gpiccoli@...lia.com>
Cc:     x86@...nel.org, kexec@...ts.infradead.org,
        Thomas Gleixner <tglx@...utronix.de>,
        linux-kernel@...r.kernel.org
Subject: Re: kdump kernel randomly hang with tick_periodic call trace on bare
 metal system

On 12/21/22 at 12:46pm, Guilherme G. Piccoli wrote:
> On 20/12/2022 02:51, Baoquan He wrote:
> > On 12/20/22 at 01:41pm, Baoquan He wrote:
> >> On one intel bare metal system, I can randomly reproduce the kdump hang
> >> as below with tick_periodic call trace. Attach the kernel config for
> >> reference.
> > 
> > Forgot mentioning this random hang is also caused by adding
> > 'nr_cpus=2' into normal kernel's cmdline, then triggering crash will get
> > kdump kernel hang as below kdump log shown.
> > 
> 
> The weird thing is that you seem to be using "nr_cpus=1" instead - this
> is the cmdline from the log:
> 
> "nr_cpus=2 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off
> numa=off udev.children-max=2 panic=10 acpi_no_memhotplug
> transparent_hugepage=never nokaslr hest_disable novmcoredd cma=0
> hugetlb_cma=0 disable_cpu_apicid=16 [...]"
> 
> You seems to pass twice the "nr_cpus" thing, and I guess kernel pick the
> last one?

>From the kdump kernel boot log, yes, the nr_cpus=1 is taken. The
parse_early_param() will parse the kernel parameters one by one, then
the last one will take effect. Here, the problem is not at nr_cpus=2 or
1, the bare metal system has 16 cpus, only 2 cpus is present, it seems
to be the halted 14 cpus get wrong message and behave incorrectly to
cause the issue.

> 
> Also, what is "disable_cpu_apicid=16"? Could this be related?

Not really. Please check disable_cpu_apicid in
Documentation/admin-guide/kdump/kdump.rst, it's bsp's apic id.