[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <fd3f73dc-a86f-4bcf-9c60-43556a21eb42@googlemail.com>
Date: Mon, 27 May 2024 09:29:38 +0200
From: Peter Schneider <pschneider1968@...glemail.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: LKML <linux-kernel@...r.kernel.org>, x86@...nel.org,
stable@...r.kernel.org, regressions@...ts.linux.dev
Subject: Kernel 6.9 regression: X86: Bogus messages from topology detection
Hi all,
I have noticed strange messages in kernel version 6.9, obviously from CPU topology
detection, which were not present in 6.8.y and earlier kernels.
This is coming from an older server machine: 2-socket Ivy Bridge Xeon E5-2697 v2 (24C/48T)
in an Asus Z9PE-D16/2L motherboard (Intel C-602A chipset); BIOS patched to the latest
available from Asus. All memory slots occupied, so 256 GB RAM in total.
From a "good boot", e.g. kernel 6.8.11, dmesg output looks like this:
[ 1.823797] smpboot: x86: Booting SMP configuration:
[ 1.823799] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11
[ 1.827514] .... node #1, CPUs: #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23
[ 0.011462] smpboot: CPU 12 Converting physical 0 to logical die 1
[ 1.875532] .... node #0, CPUs: #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35
[ 1.882453] .... node #1, CPUs: #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47
[ 1.887532] MDS CPU bug present and SMT on, data leak possible. See
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[ 1.933640] smp: Brought up 2 nodes, 48 CPUs
[ 1.933640] smpboot: Max logical packages: 2
[ 1.933640] smpboot: Total of 48 processors activated (259199.61 BogoMIPS)
From a "bad" boot, e.g. kernel 6.9.2, dmesg output has these messages in it:
[ 1.785937] smpboot: x86: Booting SMP configuration:
[ 1.785939] .... node #0, CPUs: #4
[ 1.786215] .... node #1, CPUs: #12 #16
[ 1.793547] MDS CPU bug present and SMT on, data leak possible. See
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[ 1.797547] .... node #0, CPUs: #1 #2 #3 #5 #6 #7 #8 #9 #10 #11
[ 1.801858] .... node #1, CPUs: #13 #14 #15 #17 #18 #19 #20 #21 #22 #23
[ 1.804687] .... node #0, CPUs: #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35
[ 1.810728] .... node #1, CPUs: #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47
[ 1.901547] smp: Brought up 2 nodes, 48 CPUs
[ 1.901547] smpboot: Total of 48 processors activated (259207.87 BogoMIPS)
[ 1.903803] BUG: arch topology borken
[ 1.903879] the SMT domain not a subset of the CLS domain
[ 1.903970] BUG: arch topology borken
[ 1.904040] the SMT domain not a subset of the CLS domain
[ 1.904128] BUG: arch topology borken
[ 1.904198] the SMT domain not a subset of the CLS domain
... and this "BUG" and the following line repeat 48 times which is the number of logical
CPUs this machine has. Also, there is a funny typo in the message, but that might be
intended, I guess?! Moreover I noticed, from node #1, CPU #12 detection message is
missing, so the counting maybe wrong?!
However the machine boots, and except from these strange messages, I cannot detect any
other abnormal behaviour. It is running ~15 QEMU/KVM virtual machines just fine. Because
these messages look unusual and a bit scary though, I have bisected the issue, to be able
to report it here. The first bad commit I found is this one:
22d63660c35eb751c63a709bf901a64c1726592a is the first bad commit
commit 22d63660c35eb751c63a709bf901a64c1726592a
Author: Thomas Gleixner <tglx@...utronix.de>
Date: Tue Feb 13 22:04:08 2024 +0100
x86/cpu: Use common topology code for Intel
Intel CPUs use either topology leaf 0xb/0x1f evaluation or the legacy
SMP/HT evaluation based on CPUID leaf 0x1/0x4.
Move it over to the consolidated topology code and remove the random
topology hacks which are sprinkled into the Intel and the common code.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
Tested-by: Juergen Gross <jgross@...e.com>
Tested-by: Sohil Mehta <sohil.mehta@...el.com>
Tested-by: Michael Kelley <mhklinux@...look.com>
Tested-by: Zhang Rui <rui.zhang@...el.com>
Tested-by: Wang Wendy <wendy.wang@...el.com>
Tested-by: K Prateek Nayak <kprateek.nayak@....com>
Link: https://lore.kernel.org/r/20240212153624.893644349@linutronix.de
arch/x86/kernel/cpu/common.c | 65 -----------------------------------
arch/x86/kernel/cpu/cpu.h | 4 ---
arch/x86/kernel/cpu/intel.c | 25 --------------
arch/x86/kernel/cpu/topology.c | 22 ------------
arch/x86/kernel/cpu/topology_common.c | 5 ++-
5 files changed, 4 insertions(+), 117 deletions(-)
root@...us:/usr/src/linux#
I attach my bisect log, and full dmesg output from a good and from a bad kernel version.
Moreover, the last 3 bad kernels from my bisect session did not boot at all, including the
one with commit SHA1 from the first bad commit above. These kernels also had the series of
"BUG" messages scrolling through on the console, and then additionally a kernel panic,
seemingly coming from a divide exception from function init_intel_microcode:
<5>[ 5.968685] Key type dns_resolver registered
<4>[ 5.974402] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
<4>[ 5.977017] divide error: 0000 [#1] PREEMPT SMP PTI
<4>[ 5.977116] CPU: 9 PID: 1 Comm: swapper/0 Not tainted 6.8.0-rc4+ #1
<4>[ 5.977213] Hardware name: ASUSTeK COMPUTER INC. Z9PE-D16 Series/Z9PE-D16 Series,
BIOS 5601 06/11/2015
<4>[ 5.977337] RIP: 0010:init_intel_microcode+0x3c/0x80
<4>[ 5.977436] Code: ff 75 44 40 80 fe 05 76 3e 48 8b 05 b6 45 f7 ff a9 00 00 00 40 75
30 8b 05 85 46 f7 ff 0f b7 0d aa 46 f7 ff 31 d2 48 c1 e0 0a <48> f7 f1 89 05 9b f9 46 ff
48 c7 c0 c0 98 e4 a8 31 d2 31 c9 31 f6
<4>[ 5.977602] RSP: 0000:ffffb79b8008fd80 EFLAGS: 00010206
<4>[ 5.977697] RAX: 0000000001e00000 RBX: 0000000000000000 RCX: 0000000000000000
<4>[ 5.977795] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000000
<4>[ 5.977894] RBP: ffffb79b8008fdf8 R08: 0000000000000000 R09: 0000000000000000
<4>[ 5.977992] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4>[ 5.978090] R13: 000000000000019a R14: ffffb79b8008fe08 R15: ffff96ad4026cf00
<4>[ 5.978187] FS: 0000000000000000(0000) GS:ffff96cc3fa40000(0000) knlGS:0000000000000000
<4>[ 5.978308] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 5.978402] CR2: 0000000000000000 CR3: 0000000e6d236001 CR4: 00000000001706f0
<4>[ 5.978500] Call Trace:
<4>[ 5.978588] <TASK>
<4>[ 5.978675] ? show_regs+0x6d/0x80
<4>[ 5.978767] ? die+0x37/0xa0
<4>[ 5.978857] ? do_trap+0xd4/0xf0
<4>[ 5.978948] ? do_error_trap+0x71/0xb0
<4>[ 5.979040] ? init_intel_microcode+0x3c/0x80
<4>[ 5.979131] ? exc_divide_error+0x3a/0x70
<4>[ 5.979226] ? init_intel_microcode+0x3c/0x80
<4>[ 5.979317] ? asm_exc_divide_error+0x1b/0x20
<4>[ 5.979427] ? init_intel_microcode+0x3c/0x80
<4>[ 5.979520] ? microcode_init+0x196/0x260
<4>[ 5.979612] ? __pfx_microcode_init+0x10/0x10
<4>[ 5.979718] do_one_initcall+0x5e/0x340
<4>[ 5.979813] kernel_init_freeable+0x322/0x490
<4>[ 5.979906] ? __pfx_kernel_init+0x10/0x10
<4>[ 5.979998] kernel_init+0x1b/0x200
<4>[ 5.980089] ret_from_fork+0x47/0x70
<4>[ 5.980180] ? __pfx_kernel_init+0x10/0x10
<4>[ 5.980272] ret_from_fork_asm+0x1b/0x30
<4>[ 5.980364] </TASK>
<4>[ 5.980450] Modules linked in:
<4>[ 5.980544] ---[ end trace 0000000000000000 ]---
<4>[ 6.959943] RIP: 0010:init_intel_microcode+0x3c/0x80
<4>[ 6.960041] Code: ff 75 44 40 80 fe 05 76 3e 48 8b 05 b6 45 f7 ff a9 00 00 00 40 75
30 8b 05 85 46 f7 ff 0f b7 0d aa 46 f7 ff 31 d2 48 c1 e0 0a <48> f7 f1 89 05 9b f9 46 ff
48 c7 c0 c0 98 e4 a8 31 d2 31 c9 31 f6
<4>[ 6.960207] RSP: 0000:ffffb79b8008fd80 EFLAGS: 00010206
<4>[ 6.960316] RAX: 0000000001e00000 RBX: 0000000000000000 RCX: 0000000000000000
<4>[ 6.960414] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000000
<4>[ 6.960512] RBP: ffffb79b8008fdf8 R08: 0000000000000000 R09: 0000000000000000
<4>[ 6.960610] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4>[ 6.960708] R13: 000000000000019a R14: ffffb79b8008fe08 R15: ffff96ad4026cf00
<4>[ 6.960806] FS: 0000000000000000(0000) GS:ffff96cc3fa40000(0000) knlGS:0000000000000000
<4>[ 6.960927] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 6.961021] CR2: 0000000000000000 CR3: 0000000e6d236001 CR4: 00000000001706f0
<0>[ 6.961120] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
<0>[ 6.961312] Kernel Offset: 0x25c00000 from 0xffffffff81000000 (relocation range:
0xffffffff80000000-0xffffffffbfffffff)
I also attached full dmesg log file "dmesg-erst-7373208397568540677" of this panic which I
could find in /var/lib/systemd/pstore.
Beste Grüße,
Peter Schneider
--
Climb the mountain not to plant your flag, but to embrace the challenge,
enjoy the air and behold the view. Climb it so you can see the world,
not so the world can see you. -- David McCullough Jr.
OpenPGP: 0xA3828BD796CCE11A8CADE8866E3A92C92C3FF244
Download: https://www.peters-netzplatz.de/download/pschneider1968_pub.asc
https://keys.mailvelope.com/pks/lookup?op=get&search=pschneider1968@googlemail.com
https://keys.mailvelope.com/pks/lookup?op=get&search=pschneider1968@gmail.com
View attachment "git_bisect.log" of type "text/plain" (3117 bytes)
View attachment "dmesg_v6.9.2_Bad.txt" of type "text/plain" (126368 bytes)
View attachment "dmesg_v6.8.11_Good.txt" of type "text/plain" (120622 bytes)
View attachment "dmesg-erst-7373208397568540677" of type "text/plain" (13259 bytes)
Download attachment "OpenPGP_signature.asc" of type "application/pgp-signature" (237 bytes)
Powered by blists - more mailing lists