lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20200913204623.GA10015@pcnci.linuxbox.cz>
Date:   Sun, 13 Sep 2020 22:46:23 +0200
From:   Nikola Ciprich <nikola.ciprich@...uxbox.cz>
To:     netdev@...r.kernel.org
Cc:     elic@...dia.com, nik@...uxbox.cz,
        Stanislav Schattke <schattke@...uxbox.cz>
Subject: 5.4.55 mlx5x - panic on bond link loss

Hi,

just after updating one of our clusters to 5.4.55 and reconnecting to another stack
of switches, the box panicked.. here's what i digged out from pstore:

<6>[ 1056.250637] bond0: (slave eth3): Enslaving as a backup interface with an up link
<4>[ 1057.559331] ------------[ cut here ]------------
<2>[ 1057.564307] kernel BUG at mm/slub.c:3995!
<4>[ 1057.568757] invalid opcode: 0000 [#1] SMP NOPTI
<4>[ 1057.573633] CPU: 28 PID: 21078 Comm: kworker/u64:1 Tainted: G            E     5.4.55lb7.01 #1
<4>[ 1057.582992] Hardware name: Supermicro Super Server/X11DDW-NT, BIOS 3.1 04/30/2019
<4>[ 1057.591083] Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
<4>[ 1057.597177] RIP: 0010:kfree+0x1ea/0x200
<4>[ 1057.601428] Code: d3 e0 49 8b 0c 24 48 63 d0 48 c1 e9 36 48 8b 3c cd 60 f5 1c 82 e8 d6 27 fa ff 89 de 4c 89 e7 5b 5d 41 5c 41 5d e9 f6 6a fd ff <0f> 0b 48 83 e8 01 e9 7c fe ff ff 4c 8d 60 ff e9 63 fe ff ff 66 90
<4>[ 1057.621052] RSP: 0018:ffffc90030c27c70 EFLAGS: 00010246
<4>[ 1057.626621] RAX: ffffea02f800c4c8 RBX: 0000000000000000 RCX: ffff893e523f1cd9
<4>[ 1057.634109] RDX: 0000777f80000000 RSI: ffff893e9a3f1f28 RDI: ffff893e00313332
<4>[ 1057.641599] RBP: ffff893e523f1bb0 R08: ffff893e523f1cd8 R09: ffff893e523ef018
<4>[ 1057.649157] R10: ffff893e523ef568 R11: ffffc90030c27c68 R12: ffffea02f800c4c0
<4>[ 1057.656640] R13: 0000000000000000 R14: ffff88dec0910430 R15: ffff893e92e604a0
<4>[ 1057.664121] FS:  0000000000000000(0000) GS:ffff893ebfb00000(0000) knlGS:0000000000000000
<4>[ 1057.672798] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 1057.678895] CR2: 00007fca074d2000 CR3: 000000be5cb98006 CR4: 00000000007606e0
<4>[ 1057.686375] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[ 1057.693855] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[ 1057.701340] PKRU: 55555554
<4>[ 1057.704389] Call Trace:
<4>[ 1057.707185]  kernfs_put+0x71/0x180
<4>[ 1057.710935]  __kernfs_remove+0xf7/0x1f0
<4>[ 1057.715117]  ? kernfs_name_hash+0x12/0x80
<4>[ 1057.719475]  kernfs_remove_by_name_ns+0x3e/0x80
<4>[ 1057.724422]  remove_files.isra.1+0x31/0x70
<4>[ 1057.728859]  sysfs_remove_group+0x3d/0x80
<4>[ 1057.733221]  ib_free_port_attrs+0x85/0x170 [ib_core]
<4>[ 1057.738531]  __ib_unregister_device+0x45/0x90 [ib_core]
<4>[ 1057.744104]  ib_unregister_device+0x21/0x30 [ib_core]
<4>[ 1057.749506]  __mlx5_ib_remove+0x31/0x50 [mlx5_ib]
<4>[ 1057.754560]  mlx5_remove_device+0xbf/0xd0 [mlx5_core]
<4>[ 1057.759967]  mlx5_do_bond+0x14d/0x180 [mlx5_core]
<4>[ 1057.765018]  mlx5_do_bond_work+0x1b/0x40 [mlx5_core]
<4>[ 1057.770326]  process_one_work+0x171/0x380
<4>[ 1057.774680]  worker_thread+0x49/0x3f0
<4>[ 1057.778687]  kthread+0xf8/0x130
<4>[ 1057.782165]  ? max_active_store+0x80/0x80
<4>[ 1057.786523]  ? kthread_bind+0x10/0x10
<4>[ 1057.790533]  ret_from_fork+0x1f/0x40
<4>[ 1057.794449] Modules linked in: rbd(E) ceph(E) libceph(E) dns_resolver(E) netconsole(E) bonding(E) openvswitch(E) nf_conncount(E) nf_nat(E) nsh(E) nf_conntrack(E) nf_defrag_ipv6(E) libcrc32c(E) nf_defrag_ipv4(E) ib_isert(E) iscsi_target_mod(E) ib_srpt(E) target_core_mod(E) ib_srp(E) scsi_transport_srp(E) i40iw(E) rpcrdma(E) sunrpc(E) rdma_ucm(E) ib_iser(E) rdma_cm(E) ib_umad(E) iw_cm(E) crc32_pclmul(E) ib_ipoib(E) libiscsi(E) aesni_intel(E) ib_cm(E) glue_helper(E) scsi_transport_iscsi(E) crypto_simd(E) mlx5_ib(E) ghash_clmulni_intel(E) cryptd(E) coretemp(E) iTCO_wdt(E) iTCO_vendor_support(E) ib_uverbs(E) ib_core(E) crct10dif_pclmul(E) intel_powerclamp(E) x86_pkg_temp_thermal(E) ipmi_si(E) i2c_i801(E) mei_me(E) wmi(E) i2c_core(E) mei(E) ipmi_devintf(E) lpc_ich(E) pcspkr(E) sg(E) mfd_core(E) ioatdma(E) ipmi_msghandler(E) dca(E) acpi_power_meter(E) acpi_pad(E) vhost_net(E) tun(E) vhost(E) tap(E) kvm_intel(E) kvm(E) irqbypass(E) ip_tables(E) ext4(E) mbcache(E) jbd2(E) raid1(E) sd_mod(E)
<4>[ 1057.794473]  mlx5_core(E) ahci(E) mlxfw(E) crc32c_intel(E) pci_hyperv_intf(E) libahci(E) nvme(E) i40e(E) xhci_pci(E) nvme_core(E) xhci_hcd(E) libata(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) dax(E)
<4>[ 1057.904356] ---[ end trace 5ba0d92f4b61983a ]---
<4>[ 1057.967400] RIP: 0010:kfree+0x1ea/0x200
<4>[ 1057.971584] Code: d3 e0 49 8b 0c 24 48 63 d0 48 c1 e9 36 48 8b 3c cd 60 f5 1c 82 e8 d6 27 fa ff 89 de 4c 89 e7 5b 5d 41 5c 41 5d e9 f6 6a fd ff <0f> 0b 48 83 e8 01 e9 7c fe ff ff 4c 8d 60 ff e9 63 fe ff ff 66 90
<4>[ 1057.991215] RSP: 0018:ffffc90030c27c70 EFLAGS: 00010246
<4>[ 1057.996792] RAX: ffffea02f800c4c8 RBX: 0000000000000000 RCX: ffff893e523f1cd9
<4>[ 1058.004274] RDX: 0000777f80000000 RSI: ffff893e9a3f1f28 RDI: ffff893e00313332
<4>[ 1058.011764] RBP: ffff893e523f1bb0 R08: ffff893e523f1cd8 R09: ffff893e523ef018
<4>[ 1058.019250] R10: ffff893e523ef568 R11: ffffc90030c27c68 R12: ffffea02f800c4c0
<4>[ 1058.026740] R13: 0000000000000000 R14: ffff88dec0910430 R15: ffff893e92e604a0
<4>[ 1058.034229] FS:  0000000000000000(0000) GS:ffff893ebfb00000(0000) knlGS:0000000000000000
<4>[ 1058.042914] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 1058.049009] CR2: 00007fca074d2000 CR3: 000000be5cb98006 CR4: 00000000007606e0
<4>[ 1058.056491] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[ 1058.063973] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[ 1058.071459] PKRU: 55555554
<0>[ 1058.074510] Kernel panic - not syncing: Fatal exception
<0>[ 1058.080140] Kernel Offset: disabled

anyone else observed similar problem? If not, we'll try to reproduce this on the lab..

BR

nik

-- 
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@...uxbox.cz
-------------------------------------

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ