[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAE4VaGDZr_4wzRn2___eDYRtmdPaGGJdzu_LCSkJYuY9BEO3cw@mail.gmail.com>
Date: Tue, 22 Mar 2022 00:29:58 +0100
From: Jirka Hladky <jhladky@...hat.com>
To: linux-kernel <linux-kernel@...r.kernel.org>
Cc: Philip Auld <pauld@...hat.com>, Donald Zickus <dzickus@...hat.com>
Subject: PANIC: "Oops: 0000 [#1] PREEMPT SMP PTI" starting from 5.17 on dual
socket Intel Xeon Gold servers
Starting from kernel 5.17 (tested with rc2, rc4, rc7, rc8) we
experience kernel oops on Intel Xeon Gold dual-socket servers (2x Xeon
Gold 6126 CPU)
Bellow is a backtrace and the dmesg log.
I have trouble creating a simple reproducer - it happens at random
places when preparing the NAS benchmark to be run. The script creates
a bunch of directories, compiles the benchmark a start trial runs.
Could you please help to narrow down the problem?
Reports bellow were created with kernel 5.17 rc8 and with
echo 1 > /proc/sys/kernel/panic_on_oops
setting.
crash> sys
KERNEL: /usr/lib/debug/lib/modules/5.17.0-0.rc8.123.fc37.x86_64/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 48
DATE: Thu Mar 17 02:49:40 CET 2022
UPTIME: 00:02:50
LOAD AVERAGE: 0.32, 0.10, 0.03
TASKS: 608
NODENAME: gold-2s-c
RELEASE: 5.17.0-0.rc8.123.fc37.x86_64
VERSION: #1 SMP PREEMPT Mon Mar 14 18:11:49 UTC 2022
MACHINE: x86_64 (2600 Mhz)
MEMORY: 94.7 GB
PANIC: "Oops: 0000 [#1] PREEMPT SMP PTI" (check log for details)
crash> bt
PID: 2480 TASK: ffff9e8f76cb8000 CPU: 26 COMMAND: "umount"
#0 [ffffae00cacbfbb8] machine_kexec at ffffffffbb068980
#1 [ffffae00cacbfc08] __crash_kexec at ffffffffbb1a300a
#2 [ffffae00cacbfcc8] crash_kexec at ffffffffbb1a4045
#3 [ffffae00cacbfcd0] oops_end at ffffffffbb02c410
#4 [ffffae00cacbfcf0] page_fault_oops at ffffffffbb076a38
#5 [ffffae00cacbfd68] exc_page_fault at ffffffffbbd0b7c1
#6 [ffffae00cacbfd90] asm_exc_page_fault at ffffffffbbe00ace
[exception RIP: kernfs_remove+7]
RIP: ffffffffbb421f67 RSP: ffffae00cacbfe48 RFLAGS: 00010246
RAX: 0000000000000001 RBX: ffffffffbce31e58 RCX: 0000000080200018
RDX: 0000000080200019 RSI: ffffdfbd44161640 RDI: 0000000000000000
RBP: ffffffffbce31e58 R8: 0000000000000000 R9: 0000000080200018
R10: ffff9e8f05859e80 R11: ffff9e9443b1bd98 R12: ffff9ea057f1d000
R13: ffffffffbce31e60 R14: dead000000000122 R15: dead000000000100
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffffae00cacbfe58] rdt_kill_sb at ffffffffbb05074b
#8 [ffffae00cacbfea8] deactivate_locked_super at ffffffffbb36ce1f
#9 [ffffae00cacbfec0] cleanup_mnt at ffffffffbb39176e
#10 [ffffae00cacbfee8] task_work_run at ffffffffbb10703c
#11 [ffffae00cacbff08] exit_to_user_mode_prepare at ffffffffbb17a399
#12 [ffffae00cacbff28] syscall_exit_to_user_mode at ffffffffbbd0bde8
#13 [ffffae00cacbff38] do_syscall_64 at ffffffffbbd071a6
#14 [ffffae00cacbff50] entry_SYSCALL_64_after_hwframe at ffffffffbbe0007c
RIP: 00007f442c75126b RSP: 00007ffc82d66fe8 RFLAGS: 00000202
RAX: 0000000000000000 RBX: 000055bd4cc37090 RCX: 00007f442c75126b
RDX: 0000000000000001 RSI: 0000000000000001 RDI: 000055bd4cc3b950
RBP: 000055bd4cc371a8 R8: 0000000000000000 R9: 0000000000000073
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000001
R13: 000055bd4cc3b950 R14: 000055bd4cc372c0 R15: 000055bd4cc37090
ORIG_RAX: 00000000000000a6 CS: 0033 SS: 002b
[2] dmesg
[ 172.776553] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 172.783513] #PF: supervisor read access in kernel mode
[ 172.788652] #PF: error_code(0x0000) - not-present page
[ 172.793793] PGD 0 P4D 0
[ 172.796330] Oops: 0000 [#1] PREEMPT SMP PTI
[ 172.800519] CPU: 26 PID: 2480 Comm: umount Kdump: loaded Not
tainted 5.17.0-0.rc8.123.fc37.x86_64 #1
[ 172.809645] Hardware name: Supermicro Super Server/X11DDW-L, BIOS
2.0b 03/07/2018
[ 172.817123] RIP: 0010:kernfs_remove+0x7/0x50
[ 172.821397] Code: e8 be e7 2c 00 48 89 df e8 b6 8c f0 ff 48 c7 c3
f4 ff ff ff 48 89 d8 5b 5d 41 5c 41 5d 41 5e c3 cc 66 90 0f 1f 44 00
00 55 53 <48> 8b 47 08 48 89 fb 48 85 c0 48 0f 44 c7 48 8b 68 50 48 83
c5 60
[ 172.840141] RSP: 0018:ffffae00cacbfe48 EFLAGS: 00010246
[ 172.845367] RAX: 0000000000000001 RBX: ffffffffbce31e58 RCX: 0000000080200018
[ 172.852501] RDX: 0000000080200019 RSI: ffffdfbd44161640 RDI: 0000000000000000
[ 172.859632] RBP: ffffffffbce31e58 R08: 0000000000000000 R09: 0000000080200018
[ 172.866764] R10: ffff9e8f05859e80 R11: ffff9e9443b1bd98 R12: ffff9ea057f1d000
[ 172.873899] R13: ffffffffbce31e60 R14: dead000000000122 R15: dead000000000100
[ 172.881033] FS: 00007f442c53c800(0000) GS:ffff9e9429000000(0000)
knlGS:0000000000000000
[ 172.889117] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 172.894861] CR2: 0000000000000008 CR3: 000000010ba96006 CR4: 00000000007706e0
[ 172.901997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 172.909127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 172.916261] PKRU: 55555554
[ 172.918974] Call Trace:
[ 172.921427] <TASK>
[ 172.923533] rdt_kill_sb+0x29b/0x350
[ 172.927112] deactivate_locked_super+0x2f/0xa0
[ 172.931559] cleanup_mnt+0xee/0x180
[ 172.935051] task_work_run+0x5c/0x90
[ 172.938629] exit_to_user_mode_prepare+0x229/0x230
[ 172.943424] syscall_exit_to_user_mode+0x18/0x40
[ 172.948043] do_syscall_64+0x46/0x80
[ 172.951623] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 172.956675] RIP: 0033:0x7f442c75126b
[ 172.960271] Code: cb 1b 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 90 f3
0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 91 1b 0e 00
f7 d8
[ 172.979017] RSP: 002b:00007ffc82d66fe8 EFLAGS: 00000202 ORIG_RAX:
00000000000000a6
[ 172.986584] RAX: 0000000000000000 RBX: 000055bd4cc37090 RCX: 00007f442c75126b
[ 172.993715] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 000055bd4cc3b950
[ 173.000849] RBP: 000055bd4cc371a8 R08: 0000000000000000 R09: 0000000000000073
[ 173.007980] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000001
[ 173.015115] R13: 000055bd4cc3b950 R14: 000055bd4cc372c0 R15: 000055bd4cc37090
[ 173.022249] </TASK>
[ 173.024440] Modules linked in: rfkill intel_rapl_msr
intel_rapl_common isst_if_common irdma skx_edac nfit libnvdimm ice
x86_pkg_temp_thermal intel_powerclamp coretemp ib_uverbs iTCO_wdt
intel_pmc_bxt ib_core iTCO_vendor_support kvm_
intel ipmi_ssif kvm irqbypass rapl acpi_ipmi intel_cstate i40e joydev
mei_me ioatdma i2c_i801 intel_uncore lpc_ich i2c_smbus mei
intel_pch_thermal dca ipmi_si ipmi_devintf ipmi_msghandler acpi_pad
acpi_power_meter fuse zram xfs crct10d
if_pclmul ast crc32_pclmul crc32c_intel drm_vram_helper drm_ttm_helper
ttm wmi ghash_clmulni_intel
[ 173.073900] CR2: 0000000000000008
--
-Jirka
Powered by blists - more mailing lists