[<prev] [next>] [day] [month] [year] [list]
Message-ID: <alpine.LRH.2.21.1911141316340.16328@praktifix.dwd.de>
Date:   Thu, 14 Nov 2019 14:11:58 +0000 (UTC)
From:   Holger Kiehl <Holger.Kiehl@....de>
To:     tariqt@...lanox.com, netdev@...r.kernel.org
Subject: mlx4 kernel oops when rebooting system
Hello,
I have two systems each having two Mellanox ConnectX-3 Pro cards and
every time the system is rebooted it gets a kernel oops. If I remember
it correctly, this started when I upgraded to kernel 4.19.x. Before with
4.14.x I did not get this. I tried with newer kernels, but that did
not help. Here an oops with kernel 5.1.21:
 {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
 {1}[Hardware Error]: event severity: fatal
 {1}[Hardware Error]:  Error 0, type: fatal
 {1}[Hardware Error]:   section_type: PCIe error
 {1}[Hardware Error]:   port_type: 4, root port
 {1}[Hardware Error]:   version: 1.16
 {1}[Hardware Error]:   command: 0x6010, status: 0x0143
 {1}[Hardware Error]:   device_id: 0000:80:03.0
 {1}[Hardware Error]:   slot: 0
 {1}[Hardware Error]:   secondary_bus: 0x88
 {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2f08
 {1}[Hardware Error]:   class_code: 040600
 {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0003
 {1}[Hardware Error]:  Error 1, type: fatal
 {1}[Hardware Error]:   section_type: PCIe error
 {1}[Hardware Error]:   port_type: 4, root port
 {1}[Hardware Error]:   version: 1.16
 {1}[Hardware Error]:   command: 0x6010, status: 0x0143
 {1}[Hardware Error]:   device_id: 0000:80:03.0
 {1}[Hardware Error]:   slot: 0
 {1}[Hardware Error]:   secondary_bus: 0x88
 {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2f08
 {1}[Hardware Error]:   class_code: 040600
 {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0003
 Kernel panic - not syncing: Fatal hardware error!
 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.1.21 #1
 Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 03/25/2019
 Call Trace:
  <NMI>
  dump_stack+0x46/0x5b
  panic+0xf7/0x291
  __ghes_panic+0x63/0x6f
  ghes_notify_nmi+0x1fd/0x290
  nmi_handle+0x69/0x110
  do_nmi+0x117/0x3a0
  end_repeat_nmi+0x16/0x50
 RIP: 0010:intel_idle+0x7d/0x120
 Code: 04 25 00 4d 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 75 17 e9 07 00 00 00 0f 00 2d 9e 6b 48 00 b9 01 00 00 00 48 89 e8 0f 01 c9 <65> 48 8b 04 25 00 4d 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
 RSP: 0018:ffffffff88003e40 EFLAGS: 00000046
 RAX: 0000000000000020 RBX: 0000000000000004 RCX: 0000000000000001
 RDX: 0000000000000000 RSI: ffffffff88097780 RDI: 0000000000000000
 RBP: 0000000000000020 R08: 0000000000000002 R09: ffe813d0035074b0
 R10: 0000000000000018 R11: 071c71c71c71c71c R12: 0000000000000004
 R13: ffffffff88097918 R14: ffffffff88015780 R15: 0000000000000000
  ? intel_idle+0x7d/0x120
  ? intel_idle+0x7d/0x120
  </NMI>
  cpuidle_enter_state+0x7c/0x3a0
  do_idle+0x1ad/0x230
  cpu_startup_entry+0x14/0x20
  start_kernel+0x502/0x522
  ? set_init_arg+0x50/0x50
  secondary_startup_64+0xa4/0xb0
 Kernel Offset: 0x6000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Not sure if it is related but on
    https://docs.mellanox.com/display/MLNXOFEDv471001/Bug+Fixes
I read under 1843020 "Server reboot may result in a system crash.".
So could it be that Mellanox already fixed the problem but this did
not get back ported to the kernel.org driver?
Problem with this panic is that one can no longer poweroff the system
safely because it then reboots.
Please advice what I can do or provide other information to help fix
the problem.
Regards,
Holger
PS: Please CC me since I am not on netdev@...r.kernel.org
Powered by blists - more mailing lists
 
