[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ab54f94827d200ac8a05b4ee180895b0cbd55014.camel@kernel.crashing.org>
Date: Sat, 26 Oct 2024 10:26:15 +1100
From: Benjamin Herrenschmidt <benh@...nel.crashing.org>
To: Kuniyuki Iwashima <kuniyu@...zon.com>, x86@...nel.org,
linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org
Cc: Tony Luck <tony.luck@...el.com>, Borislav Petkov <bp@...en8.de>,
Thomas
Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
Dave Hansen
<dave.hansen@...ux.intel.com>,
"H. Peter Anvin" <hpa@...or.com>
Subject: Re: WARNING in lmce_supported() during reboot.
On Fri, 2024-10-25 at 16:13 -0700, Kuniyuki Iwashima wrote:
> Hello x86 maintainers,
>
> We have seen the splat below few times when just rebooting hosts.
>
> It rarely happens and seems a timing related, so we don't have a
> reproducer.
>
> Our kernel source in the splat is here,
> https://github.com/amazonlinux/linux/tree/kernel-6.1.61-85.141.amzn2023
>
> and the triggered WARN_ON_ONCE() in lmce_supported() is here.
> https://github.com/amazonlinux/linux/blob/kernel-6.1.61-85.141.amzn2023/arch/x86/kernel/cpu/mce/intel.c#L124
(switching to my lkml/spam friendly email)
I also hit it with 6.1.112-122.189.amzn2023.x86_64
Cheers,
Ben.
> Do you have any hint ?
>
> Thanks in advance.
>
>
> ACPI: PM: Preparing to enter system sleep state S5
> reboot: Restarting system
> reboot: machine restart
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 0 at arch/x86/kernel/cpu/mce/intel.c:124
> lmce_supported (arch/x86/kernel/cpu/mce/intel.c:124
> arch/x86/kernel/cpu/mce/intel.c:99)
> Modules linked in: ib_core binfmt_misc ext4 crc16 mbcache jbd2 sunrpc
> mousedev atkbd psmouse ghash_clmulni_intel vivaldi_fmap libps2
> aesni_intel crypto_simd cryptd i8042 serio ena button sch_fq_codel
> dm_mod fuse configfs dax loop dmi_sysfs simpledrm drm_shmem_helper
> drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect
> sysimgblt fb_sys_fops cfbcopyarea drm i2c_core
> drm_panel_orientation_quirks backlight fb crc32_pclmul crc32c_intel
> fbdev efivarfs
> Hardware name: Amazon EC2 c6i.4xlarge/, BIOS 1.0 10/16/2017
> RIP: 0010:lmce_supported (arch/x86/kernel/cpu/mce/intel.c:124
> arch/x86/kernel/cpu/mce/intel.c:99)
> Code: 81 fb 00 00 00 09 75 da b9 3a 00 00 00 0f 32 48 c1 e2 20 48 09
> c2 48 89 d3 66 90 48 89 d8 48 c1 e8 14 83 e0 01 83 e3 01 75 ba <0f>
> 0b 31 c0 eb b4 31 d2 48 89 de bf 3a 00 00 00 e8 6b e6 57 00 eb
> All code
> ========
> 0: 81 fb 00 00 00 09 cmp $0x9000000,%ebx
> 6: 75 da jne 0xffffffffffffffe2
> 8: b9 3a 00 00 00 mov $0x3a,%ecx
> d: 0f 32 rdmsr
> f: 48 c1 e2 20 shl $0x20,%rdx
> 13: 48 09 c2 or %rax,%rdx
> 16: 48 89 d3 mov %rdx,%rbx
> 19: 66 90 xchg %ax,%ax
> 1b: 48 89 d8 mov %rbx,%rax
> 1e: 48 c1 e8 14 shr $0x14,%rax
> 22: 83 e0 01 and $0x1,%eax
> 25: 83 e3 01 and $0x1,%ebx
> 28: 75 ba jne 0xffffffffffffffe4
> 2a:* 0f 0b ud2 <-- trapping
> instruction
> 2c: 31 c0 xor %eax,%eax
> 2e: eb b4 jmp 0xffffffffffffffe4
> 30: 31 d2 xor %edx,%edx
> 32: 48 89 de mov %rbx,%rsi
> 35: bf 3a 00 00 00 mov $0x3a,%edi
> 3a: e8 6b e6 57 00 call 0x57e6aa
> 3f: eb .byte 0xeb
>
> Code starting with the faulting instruction
> ===========================================
> 0: 0f 0b ud2
> 2: 31 c0 xor %eax,%eax
> 4: eb b4 jmp 0xffffffffffffffba
> 6: 31 d2 xor %edx,%edx
> 8: 48 89 de mov %rbx,%rsi
> b: bf 3a 00 00 00 mov $0x3a,%edi
> 10: e8 6b e6 57 00 call 0x57e680
> 15: eb .byte 0xeb
> RSP: 0018:ffffa18f00154fb8 EFLAGS: 00010046
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000003a
> RDX: 0000000000000000 RSI: 00000000000000ff RDI: ffff965cfe2599c0
> RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: ffffa18f00154ff8 R12: 0000000000000001
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> FS: 0000000000000000(0000) GS:ffff965cfe240000(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f8485dfba30 CR3: 0000000389a10003 CR4: 00000000007706e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
> <IRQ>
> ? show_trace_log_lvl (arch/x86/kernel/dumpstack.c:259)
> ? show_trace_log_lvl (arch/x86/kernel/dumpstack.c:259)
> ? mce_intel_feature_clear (arch/x86/kernel/cpu/mce/intel.c:465
> arch/x86/kernel/cpu/mce/intel.c:502)
> ? lmce_supported (arch/x86/kernel/cpu/mce/intel.c:124
> arch/x86/kernel/cpu/mce/intel.c:99)
> ? __warn (kernel/panic.c:672)
> ? lmce_supported (arch/x86/kernel/cpu/mce/intel.c:124
> arch/x86/kernel/cpu/mce/intel.c:99)
> ? report_bug (lib/bug.c:201 lib/bug.c:219)
> ? handle_bug (arch/x86/kernel/traps.c:324)
> ? exc_invalid_op (arch/x86/kernel/traps.c:345 (discriminator 1))
> ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:568)
> ? lmce_supported (arch/x86/kernel/cpu/mce/intel.c:124
> arch/x86/kernel/cpu/mce/intel.c:99)
> ? clear_local_APIC (./arch/x86/include/asm/apic.h:393
> arch/x86/kernel/apic/apic.c:1192)
> mce_intel_feature_clear (arch/x86/kernel/cpu/mce/intel.c:465
> arch/x86/kernel/cpu/mce/intel.c:502)
> stop_this_cpu (arch/x86/kernel/process.c:780)
> __sysvec_reboot (arch/x86/kernel/smp.c:140)
> sysvec_reboot (arch/x86/kernel/smp.c:136 (discriminator 14))
> </IRQ>
> <TASK>
> asm_sysvec_reboot (./arch/x86/include/asm/idtentry.h:656)
> RIP: 0010:acpi_idle_do_entry (./arch/x86/include/asm/irqflags.h:40
> ./arch/x86/include/asm/irqflags.h:75
> drivers/acpi/processor_idle.c:113 drivers/acpi/processor_idle.c:572)
> Code: 75 08 48 8b 15 b1 81 df 02 ed c3 cc cc cc cc 65 48 8b 04 25 00
> ff 01 00 48 8b 00 a8 08 75 eb 66 90 0f 00 2d 58 c8 6a 00 fb f4 <fa>
> c3 cc cc cc cc e9 01 fc ff ff 90 0f 1f 44 00 00 41 56 41 55 41
> All code
> ========
> 0: 75 08 jne 0xa
> 2: 48 8b 15 b1 81 df 02 mov 0x2df81b1(%rip),%rdx #
> 0x2df81ba
> 9: ed in (%dx),%eax
> a: c3 ret
> b: cc int3
> c: cc int3
> d: cc int3
> e: cc int3
> f: 65 48 8b 04 25 00 ff mov %gs:0x1ff00,%rax
> 16: 01 00
> 18: 48 8b 00 mov (%rax),%rax
> 1b: a8 08 test $0x8,%al
> 1d: 75 eb jne 0xa
> 1f: 66 90 xchg %ax,%ax
> 21: 0f 00 2d 58 c8 6a 00 verw 0x6ac858(%rip) #
> 0x6ac880
> 28: fb sti
> 29: f4 hlt
> 2a:* fa cli <-- trapping
> instruction
> 2b: c3 ret
> 2c: cc int3
> 2d: cc int3
> 2e: cc int3
> 2f: cc int3
> 30: e9 01 fc ff ff jmp 0xfffffffffffffc36
> 35: 90 nop
> 36: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
> 3b: 41 56 push %r14
> 3d: 41 55 push %r13
> 3f: 41 rex.B
>
> Code starting with the faulting instruction
> ===========================================
> 0: fa cli
> 1: c3 ret
> 2: cc int3
> 3: cc int3
> 4: cc int3
> 5: cc int3
> 6: e9 01 fc ff ff jmp 0xfffffffffffffc0c
> b: 90 nop
> c: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
> 11: 41 56 push %r14
> 13: 41 55 push %r13
> 15: 41 rex.B
> RSP: 0018:ffffa18f000afe70 EFLAGS: 00000246
> RAX: 0000000000004000 RBX: ffff965603d92400 RCX: 4000000000000000
> RDX: ffff965cfe240000 RSI: ffff965601478800 RDI: ffff965601478864
> RBP: 0000000000000001 R08: ffffffffb62182c0 R09: 0000000000000000
> R10: 0000000000002703 R11: 000000000001993d R12: 0000000000000001
> R13: ffffffffb6218340 R14: 0000000000000001 R15: 0000000000000000
> acpi_idle_enter (drivers/acpi/processor_idle.c:711 (discriminator 3))
> cpuidle_enter_state (drivers/cpuidle/cpuidle.c:239)
> cpuidle_enter (drivers/cpuidle/cpuidle.c:358)
> cpuidle_idle_call (kernel/sched/idle.c:240)
> do_idle (kernel/sched/idle.c:305)
> cpu_startup_entry (kernel/sched/idle.c:400 (discriminator 1))
> start_secondary (arch/x86/kernel/smpboot.c:215
> arch/x86/kernel/smpboot.c:249)
> secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358)
> </TASK>
> ---[ end trace 0000000000000000 ]---
Powered by blists - more mailing lists