lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <alpine.LRH.2.21.1911141316340.16328@praktifix.dwd.de>
Date:   Thu, 14 Nov 2019 14:11:58 +0000 (UTC)
From:   Holger Kiehl <Holger.Kiehl@....de>
To:     tariqt@...lanox.com, netdev@...r.kernel.org
Subject: mlx4 kernel oops when rebooting system

Hello,

I have two systems each having two Mellanox ConnectX-3 Pro cards and
every time the system is rebooted it gets a kernel oops. If I remember
it correctly, this started when I upgraded to kernel 4.19.x. Before with
4.14.x I did not get this. I tried with newer kernels, but that did
not help. Here an oops with kernel 5.1.21:

 {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
 {1}[Hardware Error]: event severity: fatal
 {1}[Hardware Error]:  Error 0, type: fatal
 {1}[Hardware Error]:   section_type: PCIe error
 {1}[Hardware Error]:   port_type: 4, root port
 {1}[Hardware Error]:   version: 1.16
 {1}[Hardware Error]:   command: 0x6010, status: 0x0143
 {1}[Hardware Error]:   device_id: 0000:80:03.0
 {1}[Hardware Error]:   slot: 0
 {1}[Hardware Error]:   secondary_bus: 0x88
 {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2f08
 {1}[Hardware Error]:   class_code: 040600
 {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0003
 {1}[Hardware Error]:  Error 1, type: fatal
 {1}[Hardware Error]:   section_type: PCIe error
 {1}[Hardware Error]:   port_type: 4, root port
 {1}[Hardware Error]:   version: 1.16
 {1}[Hardware Error]:   command: 0x6010, status: 0x0143
 {1}[Hardware Error]:   device_id: 0000:80:03.0
 {1}[Hardware Error]:   slot: 0
 {1}[Hardware Error]:   secondary_bus: 0x88
 {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2f08
 {1}[Hardware Error]:   class_code: 040600
 {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0003
 Kernel panic - not syncing: Fatal hardware error!
 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.1.21 #1
 Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 03/25/2019
 Call Trace:
  <NMI>
  dump_stack+0x46/0x5b
  panic+0xf7/0x291
  __ghes_panic+0x63/0x6f
  ghes_notify_nmi+0x1fd/0x290
  nmi_handle+0x69/0x110
  do_nmi+0x117/0x3a0
  end_repeat_nmi+0x16/0x50
 RIP: 0010:intel_idle+0x7d/0x120
 Code: 04 25 00 4d 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 75 17 e9 07 00 00 00 0f 00 2d 9e 6b 48 00 b9 01 00 00 00 48 89 e8 0f 01 c9 <65> 48 8b 04 25 00 4d 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
 RSP: 0018:ffffffff88003e40 EFLAGS: 00000046
 RAX: 0000000000000020 RBX: 0000000000000004 RCX: 0000000000000001
 RDX: 0000000000000000 RSI: ffffffff88097780 RDI: 0000000000000000
 RBP: 0000000000000020 R08: 0000000000000002 R09: ffe813d0035074b0
 R10: 0000000000000018 R11: 071c71c71c71c71c R12: 0000000000000004
 R13: ffffffff88097918 R14: ffffffff88015780 R15: 0000000000000000
  ? intel_idle+0x7d/0x120
  ? intel_idle+0x7d/0x120
  </NMI>
  cpuidle_enter_state+0x7c/0x3a0
  do_idle+0x1ad/0x230
  cpu_startup_entry+0x14/0x20
  start_kernel+0x502/0x522
  ? set_init_arg+0x50/0x50
  secondary_startup_64+0xa4/0xb0
 Kernel Offset: 0x6000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Not sure if it is related but on

    https://docs.mellanox.com/display/MLNXOFEDv471001/Bug+Fixes

I read under 1843020 "Server reboot may result in a system crash.".
So could it be that Mellanox already fixed the problem but this did
not get back ported to the kernel.org driver?

Problem with this panic is that one can no longer poweroff the system
safely because it then reboots.

Please advice what I can do or provide other information to help fix
the problem.

Regards,
Holger


PS: Please CC me since I am not on netdev@...r.kernel.org

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ