[<prev] [next>] [day] [month] [year] [list]
Message-ID: <b8f845d8-9642-2008-4d7d-819ed6e44d9f@linux.ibm.com>
Date: Mon, 10 Oct 2022 11:59:07 +0530
From: Ganesh <ganeshgr@...ux.ibm.com>
To: saeedm@...dia.com, tariqt@...dia.com
Cc: netdev@...r.kernel.org, oohall@...il.com
Subject: mlx cards failing to recover from EEH errors
Hi Saeed/Tariq,
On PPC, Since v5.17-rc1 EEH error recovery is not happening on mlx cards,
Any idea what has gone wrong ?
Log:
[ 649.335564] mlx5_core 0041:01:00.0: poll_health:817:(pid 0): Fatal
error 1 detected
[ 649.335588] mlx5_core 0041:01:00.0: print_health_info:423:(pid 0):
PCI slot is unavailable
[ 649.336397] EEH: Recovering PHB#41-PE#10000
[ 649.336410] EEH: PE location: N/A, PHB location: N/A
[ 649.336416] EEH: Frozen PHB#41-PE#10000 detected
[ 649.336422] EEH: Call Trace:
[ 649.336426] EEH: [c000000000053b78] __eeh_send_failure_event+0x78/0x150
[ 649.336597] EEH: [c00000000004ccf8] eeh_dev_check_failure+0x318/0x6b0
[ 649.336607] EEH: [c00000000004d158] eeh_check_failure+0xc8/0x100
[ 649.336615] EEH: [c0000000007e1454] ioread32be+0x114/0x180
[ 649.336623] EEH: [c008000002eb4308]
mlx5_health_check_fatal_sensors+0x30/0x190 [mlx5_core]
[ 649.336684] EEH: [c008000002eb4c10] poll_health+0x58/0x270 [mlx5_core]
[ 649.336740] EEH: [c000000000237098] call_timer_fn+0x58/0x1f0
[ 649.336748] EEH: [c000000000238490] run_timer_softirq+0x360/0x7f0
[ 649.336756] EEH: [c000000000e971f8] __do_softirq+0x158/0x3d0
[ 649.336766] EEH: [c0000000000167c0] do_softirq_own_stack+0x40/0x60
[ 649.336775] EEH: [c00000000014dd98] irq_exit+0x178/0x1c0
[ 649.336782] EEH: [c0000000000273f4] timer_interrupt+0x194/0x410
[ 649.336791] EEH: [c000000000009964] decrementer_common_virt+0x214/0x220
[ 649.336805] EEH: [0000009721f89180] 0x9721f89180
[ 649.336813] EEH: [c000000000b349bc] dedicated_cede_loop+0x9c/0x1b0
[ 649.336825] EEH: [c000000000b31114] cpuidle_enter_state+0x364/0x570
[ 649.336835] EEH: [c000000000b313c0] cpuidle_enter+0x50/0x70
[ 649.336845] EEH: [c0000000001bac68] do_idle+0x3a8/0x420
[ 649.336855] EEH: [c0000000001baf18] cpu_startup_entry+0x38/0x40
[ 649.336869] EEH: [c0000000000125bc] rest_init+0xec/0xf0
[ 649.336878] EEH: [c000000002004274] arch_post_acpi_subsys_init+0x0/0xc
[ 649.336889] EEH: [c000000002004cc4] start_kernel+0xa14/0xa64
[ 649.336900] EEH: [c00000000000d780] start_here_common+0x1c/0x9c
[ 649.336909] EEH: This PCI device has failed 1 times in the last hour
and will be permanently disabled after 5 failures.
[ 649.336921] EEH: Notify device drivers to shutdown
[ 649.336929] EEH: Beginning: 'error_detected(IO frozen)'
[ 649.336937] PCI 0041:01:00.0#10000: EEH: Invoking
mlx5_core->error_detected(IO frozen)
[ 649.336943] mlx5_core 0041:01:00.0: mlx5_pci_err_detected Device
state = 2 health sensors: 1 pci_status: 1. Enter, pci channel state = 2
[ 649.336967] mlx5_core 0041:01:00.0: mlx5_error_sw_reset:243:(pid
198): start
[ 649.339914] mlx5_core 0041:01:00.0: mlx5_crdump_fill:34:(pid 206):
failed to read full dump, read 32 out of 1179644
[ 649.340586] mlx5_core 0041:01:00.0: mlx5_health_try_recover:335:(pid
206): handling bad device here
[ 650.695123] mlx5_core 0041:01:00.1: poll_health:817:(pid 0): Fatal
error 1 detected
[ 650.695140] mlx5_core 0041:01:00.1: print_health_info:423:(pid 0):
PCI slot is unavailable
[ 650.698918] mlx5_core 0041:01:00.1: mlx5_crdump_fill:34:(pid 1016):
failed to read full dump, read 32 out of 1179644
[ 650.699581] mlx5_core 0041:01:00.1: mlx5_health_try_recover:335:(pid
1016): handling bad device here
[ 650.699594] mlx5_core 0041:01:00.1: mlx5_error_sw_reset:243:(pid
1016): start
[ 651.375132] mlx5_core 0041:01:00.0: NIC IFC still 7 after 2000ms.
[ 651.375149] mlx5_core 0041:01:00.0: mlx5_error_sw_reset:276:(pid
198): end
[ 651.375171] mlx5_core 0041:01:00.0: mlx5_error_sw_reset:243:(pid
206): start
[ 652.735131] mlx5_core 0041:01:00.1: NIC IFC still 7 after 2000ms.
[ 652.735143] mlx5_core 0041:01:00.1: mlx5_error_sw_reset:276:(pid
1016): end
[ 653.075350] mlx5_core 0041:01:00.1: mlx5_wait_for_pages:759:(pid
1016): Skipping wait for vf pages stage
[ 653.135142] mlx5_core 0041:01:00.1: E-Switch: Disable: mode(LEGACY),
nvfs(0), active vports(0)
[ 653.395129] mlx5_core 0041:01:00.0: NIC IFC still 7 after 2000ms.
[ 653.395151] mlx5_core 0041:01:00.0: mlx5_error_sw_reset:276:(pid
206): end
[ 653.765354] mlx5_core 0041:01:00.0: mlx5_wait_for_pages:759:(pid
206): Skipping wait for vf pages stage
[ 653.835135] mlx5_core 0041:01:00.0: E-Switch: Disable: mode(LEGACY),
nvfs(0), active vports(0)
[ 716.685141] mlx5_core 0041:01:00.1: mlx5_health_try_recover:338:(pid
1016): health recovery flow aborted, PCI reads still not working
[ 717.245141] mlx5_core 0041:01:00.0: mlx5_health_try_recover:338:(pid
206): health recovery flow aborted, PCI reads still not working
[ 717.245175] mlx5_core 0041:01:00.0:
mlx5_unload_one_devl_locked:1505:(pid 198): mlx5_unload_one_devl_locked:
interface is down, NOP
[ 717.245290] mlx5_core 0041:01:00.0: mlx5_pci_err_detected Device
state = 2 health sensors: 1 pci_status: 0. Exit, result = 3, need reset
[ 717.245332] PCI 0041:01:00.0#10000: EEH: mlx5_core driver reports:
'need reset'
[ 717.245338] PCI 0041:01:00.1#10000: EEH: Invoking
mlx5_core->error_detected(IO frozen)
[ 717.245352] mlx5_core 0041:01:00.1: mlx5_pci_err_detected Device
state = 2 health sensors: 1 pci_status: 1. Enter, pci channel state = 2
[ 717.245380] mlx5_core 0041:01:00.1: mlx5_error_sw_reset:243:(pid
198): start
[ 719.265142] mlx5_core 0041:01:00.1: NIC IFC still 7 after 2000ms.
[ 719.265160] mlx5_core 0041:01:00.1: mlx5_error_sw_reset:276:(pid
198): end
[ 719.265170] mlx5_core 0041:01:00.1:
mlx5_unload_one_devl_locked:1505:(pid 198): mlx5_unload_one_devl_locked:
interface is down, NOP
[ 719.265350] mlx5_core 0041:01:00.1: mlx5_pci_err_detected Device
state = 2 health sensors: 1 pci_status: 0. Exit, result = 3, need reset
[ 719.265385] PCI 0041:01:00.1#10000: EEH: mlx5_core driver reports:
'need reset'
[ 719.265391] EEH: Finished:'error_detected(IO frozen)' with aggregate
recovery state:'need reset'
[ 719.265526] EEH: Collect temporary log
[ 719.267450] EEH: of node=0041:01:00.0
[ 719.267543] EEH: PCI device/vendor: 101515b3
[ 719.267616] EEH: PCI cmd/status register: 00100140
[ 719.267625] EEH: PCI-E capabilities and status follow:
[ 719.267962] EEH: PCI-E 00: 00024810 10008fe2 0009595f 00417883
[ 719.268232] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 719.268241] EEH: PCI-E 20: 00000000
[ 719.268245] EEH: PCI-E AER capability register set follows:
[ 719.268567] EEH: PCI-E AER 00: 15010001 00000000 00000000 00062010
[ 719.268827] EEH: PCI-E AER 10: 00002000 00002000 000001e4 00000000
[ 719.269087] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 719.269157] EEH: PCI-E AER 30: 00000000 00000000
[ 719.269165] EEH: of node=0041:01:00.1
[ 719.269234] EEH: PCI device/vendor: 101515b3
[ 719.269305] EEH: PCI cmd/status register: 00100142
[ 719.269312] EEH: PCI-E capabilities and status follow:
[ 719.269649] EEH: PCI-E 00: 00024810 10008fe2 0009595f 00417883
[ 719.269913] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 719.269921] EEH: PCI-E 20: 00000000
[ 719.269926] EEH: PCI-E AER capability register set follows:
[ 719.270245] EEH: PCI-E AER 00: 15010001 00000000 00000000 00062010
[ 719.270507] EEH: PCI-E AER 10: 00002000 00002000 000001e4 00000000
[ 719.270765] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 719.270835] EEH: PCI-E AER 30: 00000000 00000000
[ 719.272823] EEH: Reset without hotplug activity
[ 721.637535] EEH: Beginning: 'slot_reset'
[ 721.637547] PCI 0041:01:00.0#10000: EEH: Invoking mlx5_core->slot_reset()
[ 721.637907] mlx5_core 0041:01:00.0: mlx5_pci_slot_reset Device state
= 2 health sensors: 1 pci_status: 0. Enter
[ 721.638058] mlx5_core 0041:01:00.0: enabling device (0140 -> 0142)
[ 728.645144] mlx5_core 0041:01:00.0: mlx5_pci_slot_reset:1856:(pid
198): mlx5_pci_slot_reset: wait vital failed with error code: -110
[ 728.645164] mlx5_core 0041:01:00.0: mlx5_pci_slot_reset Device state
= 2 health sensors: 1 pci_status: 1. Exit, err = -110, result = 4,
disconnect
[ 728.645177] PCI 0041:01:00.0#10000: EEH: mlx5_core driver reports:
'disconnect'
[ 728.645182] PCI 0041:01:00.1#10000: EEH: Invoking mlx5_core->slot_reset()
[ 728.645191] mlx5_core 0041:01:00.1: mlx5_pci_slot_reset Device state
= 2 health sensors: 1 pci_status: 0. Enter
[ 728.645351] mlx5_core 0041:01:00.1: enabling device (0140 -> 0142)
[ 735.645147] mlx5_core 0041:01:00.1: mlx5_pci_slot_reset:1856:(pid
198): mlx5_pci_slot_reset: wait vital failed with error code: -110
[ 735.645169] mlx5_core 0041:01:00.1: mlx5_pci_slot_reset Device state
= 2 health sensors: 1 pci_status: 1. Exit, err = -110, result = 4,
disconnect
[ 735.645182] PCI 0041:01:00.1#10000: EEH: mlx5_core driver reports:
'disconnect'
[ 735.645187] EEH: Finished:'slot_reset' with aggregate recovery
state:'disconnect'
[ 735.645199] EEH: Unable to recover from failure from PHB#41-PE#10000.
[ 735.645199] Please try reseating or replacing it
[ 735.647109] EEH: of node=0041:01:00.0
[ 735.647180] EEH: PCI device/vendor: 101515b3
[ 735.647251] EEH: PCI cmd/status register: 00100546
[ 735.647259] EEH: PCI-E capabilities and status follow:
[ 735.647592] EEH: PCI-E 00: 00024810 10008fe2 0009595f 00417883
[ 735.647858] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 735.647866] EEH: PCI-E 20: 00000000
[ 735.647870] EEH: PCI-E AER capability register set follows:
[ 735.648195] EEH: PCI-E AER 00: 15010001 00000000 00000000 00062010
[ 735.648456] EEH: PCI-E AER 10: 00000000 00002000 000001e4 00000000
[ 735.648713] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 735.648783] EEH: PCI-E AER 30: 00000000 00000000
[ 735.648791] EEH: of node=0041:01:00.1
[ 735.648861] EEH: PCI device/vendor: 101515b3
[ 735.648930] EEH: PCI cmd/status register: 00100546
[ 735.648937] EEH: PCI-E capabilities and status follow:
[ 735.649267] EEH: PCI-E 00: 00024810 10008fe2 0009595f 00417883
[ 735.649533] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 735.649540] EEH: PCI-E 20: 00000000
[ 735.649544] EEH: PCI-E AER capability register set follows:
[ 735.649864] EEH: PCI-E AER 00: 15010001 00000000 00000000 00062010
[ 735.650127] EEH: PCI-E AER 10: 00000000 00002000 000001e4 00000000
[ 735.650386] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 735.650456] EEH: PCI-E AER 30: 00000000 00000000
[ 735.652599] EEH: Beginning: 'error_detected(permanent failure)'
[ 735.652608] PCI 0041:01:00.0#10000: EEH: not actionable (1,1,1)
[ 735.652612] PCI 0041:01:00.1#10000: EEH: not actionable (1,1,1)
[ 735.652619] EEH: Finished:'error_detected(permanent failure)'
[ 735.725229] mlx5_core 0041:01:00.1: E-Switch: Unload vfs:
mode(LEGACY), nvfs(0), active vports(0)
[ 735.725265] mlx5_core 0041:01:00.1: mlx5_wait_for_pages:759:(pid
198): Skipping wait for vf pages stage
[ 736.005168] mlx5_core 0041:01:00.1: mlx5_uninit_one:1424:(pid 198):
mlx5_uninit_one: interface is down, NOP
[ 736.005425] mlx5_core 0041:01:00.1: E-Switch: cleanup
[ 736.165474] pci 0041:01:00.1: Removing from iommu group 0
[ 736.265197] mlx5_core 0041:01:00.0: E-Switch: Unload vfs:
mode(LEGACY), nvfs(0), active vports(0)
[ 736.265214] mlx5_core 0041:01:00.0: mlx5_wait_for_pages:759:(pid
198): Skipping wait for vf pages stage
[ 736.515164] mlx5_core 0041:01:00.0: mlx5_uninit_one:1424:(pid 198):
mlx5_uninit_one: interface is down, NOP
[ 736.515366] mlx5_core 0041:01:00.0: E-Switch: cleanup
[ 736.605455] pci 0041:01:00.0: Removing from iommu group 0
[ 736.605562] eeh_handle_normal_event: Cannot find PCI bus for
PHB#41-PE#10000
Powered by blists - more mailing lists