lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+QrTELaLFRGn1ynG5dG+KB_40aPA31hU5QgLn7ikh2Zbk3Hpg@mail.gmail.com>
Date: Wed, 6 Aug 2025 15:19:02 -0500
From: Fred Lawler <fred@...udflare.com>
To: Corey Minyard <corey@...yard.net>
Cc: openipmi-developer@...ts.sourceforge.net, linux-kernel@...r.kernel.org, 
	kernel-team@...udflare.com
Subject: Re: [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP

+ CC: Corey Minyard <corey@...yard.net>

Misspelled the email address.

On Wed, Aug 6, 2025 at 3:14 PM Frederick Lawler <fred@...udflare.com> wrote:
>
> Hi Corey,
>
> In kernel 6.12.y, while resetting the BMC, we can sometimes hit a hard LOCKUP
> watchdog event, especially so while querying the BMC for basic device
> information via sysfs.
>
> I havn't been able to create a consistent reproducer yet, but I suspect
> that these occur during high traffic, BMC is resetting, and reading
> from the sysfs files in parallel. We're also using KCS to interface
> with the BMC.
>
> I can consistently reproduce hung tasks trivially with the following,
> during a BMC reset:
>
> while true; do cat aux_firmware_revision &>/dev/null; done &
>
> I tried also tried to load the CPUs with stress-ng, but the best I can do
> are the hung tasks.
>
> I identified that sni_send()[1] could be locked behind the
> spin_lock_irqsave() and within the KCS send handler, there's another irq
> save lock. I suspect this is where we're getting hung up. Below is a
> sample stack trace + log output.
>
> I'm happy to provide traces and additional information, let me know.
>
> Links:
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/ipmi/ipmi_msghandler.c?h=linux-6.12.y#n1899
>
> [  499.564572] [  T27255] ip6_tunnel: pni_gre_814 xmit: Local address not yet configured!
> [  499.588176] [  T27255] ip6_tunnel: pni_gre_868 xmit: Local address not yet configured!
> [  499.605284] [  T27255] ip6_tunnel: pni_gre_871 xmit: Local address not yet configured!
> [  805.906999] [  T12765] usb 1-1: USB disconnect, device number 2
> [  845.346020] [  T12765] usb 1-1: new high-speed USB device number 3 using xhci_hcd
> [  845.485453] [  T12765] usb 1-1: New USB device found, idVendor=1d6b, idProduct=0107, bcdDevice= 1.00
> [  845.496823] [  T12765] usb 1-1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> [  845.507242] [  T12765] usb 1-1: Product: USB Virtual Hub
> [  845.514946] [  T12765] usb 1-1: Manufacturer: Aspeed
> [  845.522363] [  T12765] usb 1-1: SerialNumber: 00000000
> [  845.530454] [  T12765] usb 1-1: Device is not authorized for usage
> [  853.774910] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.783794] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.792649] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.801461] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.810291] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.819069] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.827816] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.836581] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.845326] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.854074] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.862813] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  863.934436] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> [  863.943420] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> [  863.952363] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> [  863.961296] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> [  878.616336] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> [  878.624905] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> [  878.633427] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> [  878.641954] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> [  880.310112] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> [  880.318682] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> [  880.327083] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> [  880.335483] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> [  904.196122] [     C33] watchdog: Watchdog detected hard LOCKUP on cpu 33
> [  904.196127] [     C97] Uhhuh. NMI received for unknown reason 3d on CPU 97.
> [  904.196126] [      C6] Uhhuh. NMI received for unknown reason 3d on CPU 6.
> [  904.196130] [     C33] Modules linked in:
> [  904.196129] [    C101] Uhhuh. NMI received for unknown reason 3d on CPU 101.
> [  904.196131] [     C97] Dazed and confused, but trying to continue
> [  904.196131] [     C33]  nft_fwd_netdev
> [  904.196131] [     C99] Uhhuh. NMI received for unknown reason 2d on CPU 99.
> [  904.196133] [      C6] Dazed and confused, but trying to continue
> [  904.196133] [    C102] Uhhuh. NMI received for unknown reason 2d on CPU 102.
> [  904.196134] [     C33]  nf_dup_netdev
> [  904.196134] [     C35] Uhhuh. NMI received for unknown reason 2d on CPU 35.
> [  904.196135] [    C101] Dazed and confused, but trying to continue
> [  904.196137] [     C99] Dazed and confused, but trying to continue
> [  904.196137] [     C33]  xfrm_interface
> [  904.196136] [     C69] Uhhuh. NMI received for unknown reason 2d on CPU 69.
> [  904.196140] [    C102] Dazed and confused, but trying to continue
> [  904.196140] [     C33]  xfrm6_tunnel
> [  904.196138] [    C121] Uhhuh. NMI received for unknown reason 2d on CPU 121.
> [  904.196140] [    C123] Uhhuh. NMI received for unknown reason 2d on CPU 123.
> [  904.196142] [     C35] Dazed and confused, but trying to continue
> [  904.196143] [     C69] Dazed and confused, but trying to continue
> [  904.196143] [     C33]  nft_numgen
> [  904.196143] [     C61] Uhhuh. NMI received for unknown reason 2d on CPU 61.
> [  904.196144] [     C62] Uhhuh. NMI received for unknown reason 3d on CPU 62.
> [  904.196146] [    C123] Dazed and confused, but trying to continue
> [  904.196147] [    C121] Dazed and confused, but trying to continue
> [  904.196148] [     C58] Dazed and confused, but trying to continue
> [  904.196150] [     C33]  nft_log nft_limit sit dummy ipip tunnel4 ip_gre gre xfrm_user xfrm_algo tls mpls_iptunnel mpls_router nft_ct nf_tables iptable_raw iptable_nat iptable_mangle ipt_REJECT nf_reject_ipv4 ip6table_security xt_CT ip6table_raw xt_nat ip6table_nat nf_nat xt_TCPMSS xt_owner xt_DSCP xt_NFLOG xt_connbytes xt_connlabel xt_statistic xt_connmark ip6table_mangle xt_limit xt_LOG nf_log_syslog xt_mark xt_conntrack ip6t_REJECT nf_reject_ipv6 xt_multiport xt_set xt_tcpmss xt_comment xt_tcpudp ip6table_filter ip6_tables nfnetlink_log udp_diag dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio iptable_filter veth tcp_diag inet_diag mpls_gso act_mpls cls_flower cls_bpf sch_ingress ip_set_hash_ip ip_set_hash_net ip_set tcp_bbr sch_fq tun xt_bpf nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 fou6 fou ip_tunnel ip6_udp_tunnel udp_tunnel ip6_tunnel tunnel6 nvme_fabrics raid0 md_mod essiv dm_crypt trusted asn1_encoder tee dm_mod dax 8021q garp mrp stp llc ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel
> [  904.196247] [     C33]  sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd xhci_pci binfmt_misc acpi_ipmi cryptd ipmi_si nvme rapl ipmi_devintf i2c_piix4 tiny_power_button bnxt_en xhci_hcd nvme_core ccp i2c_smbus ipmi_msghandler button fuse configfs nfnetlink efivarfs ip_tables x_tables bcmcrypt(O)
> [  904.196281] [     C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G           O       6.12.34-cloudflare-2025.6.9 #1
> [  904.196286] [     C33] Tainted: [O]=OOT_MODULE
> [  904.196287] [     C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
> [  904.196290] [     C33] RIP: 0010:io_idle+0x3/0x30
> [  904.196298] [     C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
> [  904.196301] [     C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
> [  904.196304] [     C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
> [  904.196306] [     C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
> [  904.196308] [     C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
> [  904.196309] [     C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
> [  904.196311] [     C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
> [  904.196313] [     C33] FS:  0000000000000000(0000) GS:ffff8aadcf680000(0000) knlGS:0000000000000000
> [  904.196316] [     C33] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  904.196318] [     C33] CR2: 00005632ce239000 CR3: 0000003944f72004 CR4: 0000000000770ef0
> [  904.196320] [     C33] PKRU: 55555554
> [  904.196322] [     C33] Call Trace:
> [  904.196324] [     C33]  <TASK>
> [  904.196326] [     C33]  acpi_idle_do_entry+0x22/0x50
> [  904.196336] [     C33]  acpi_idle_enter+0x7b/0xd0
> [  904.196340] [     C33]  cpuidle_enter_state+0x79/0x420
> [  904.196345] [     C33]  cpuidle_enter+0x2d/0x40
> [  904.196352] [     C33]  do_idle+0x176/0x1c0
> [  904.196358] [     C33]  cpu_startup_entry+0x29/0x30
> [  904.196362] [     C33]  start_secondary+0xf7/0x100
> [  904.196366] [     C33]  common_startup_64+0x13e/0x141
> [  904.196374] [     C33]  </TASK>
> [  904.196377] [     C33] Kernel panic - not syncing: Hard LOCKUP
> [  904.196379] [     C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G           O       6.12.34-cloudflare-2025.6.9 #1
> [  904.196383] [     C33] Tainted: [O]=OOT_MODULE
> [  904.196384] [     C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
> [  904.196385] [     C33] Call Trace:
> [  904.196387] [     C33]  <NMI>
> [  904.196389] [     C33]  dump_stack_lvl+0x4b/0x70
> [  904.196394] [     C33]  panic+0x106/0x2c4
> [  904.196401] [     C33]  nmi_panic.cold+0xc/0xc
> [  904.196404] [     C33]  watchdog_hardlockup_check.cold+0xc6/0xe8
> [  904.196409] [     C33]  __perf_event_overflow+0x15a/0x450
> [  904.196416] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  904.196421] [     C33]  x86_pmu_handle_irq+0x18a/0x1c0
> [  904.196436] [     C33]  ? set_pte_vaddr+0x40/0x50
> [  904.196439] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  904.196442] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  904.196445] [     C33]  ? native_set_fixmap+0x63/0xb0
> [  904.196448] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  904.196451] [     C33]  ? ghes_copy_tofrom_phys+0x7a/0x100
> [  904.196457] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  904.196460] [     C33]  ? __ghes_peek_estatus.isra.0+0x49/0xa0
> [  904.196465] [     C33]  amd_pmu_handle_irq+0x4b/0xc0
> [  904.196469] [     C33]  perf_event_nmi_handler+0x2a/0x50
> [  904.196473] [     C33]  nmi_handle.part.0+0x59/0x110
> [  904.196479] [     C33]  default_do_nmi+0x127/0x180
> [  904.196483] [     C33]  exc_nmi+0x103/0x180
> [  904.196486] [     C33]  end_repeat_nmi+0xf/0x53
> [  904.196489] [     C33] RIP: 0010:io_idle+0x3/0x30
> [  904.196493] [     C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
> [  904.196495] [     C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
> [  904.196497] [     C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
> [  904.196499] [     C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
> [  904.196501] [     C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
> [  904.196502] [     C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
> [  904.196504] [     C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
> [  904.196510] [     C33]  ? io_idle+0x3/0x30
> [  904.196515] [     C33]  ? io_idle+0x3/0x30
> [  904.196519] [     C33]  </NMI>
> [  904.196520] [     C33]  <TASK>
> [  904.196521] [     C33]  acpi_idle_do_entry+0x22/0x50
> [  904.196526] [     C33]  acpi_idle_enter+0x7b/0xd0
> [  904.196529] [     C33]  cpuidle_enter_state+0x79/0x420
> [  904.196535] [     C33]  cpuidle_enter+0x2d/0x40
> [  904.196539] [     C33]  do_idle+0x176/0x1c0
> [  904.196544] [     C33]  cpu_startup_entry+0x29/0x30
> [  904.196548] [     C33]  start_secondary+0xf7/0x100
> [  904.196552] [     C33]  common_startup_64+0x13e/0x141
> [  904.196559] [     C33]  </TASK>
>
> Best, Fred

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ