lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <b8f845d8-9642-2008-4d7d-819ed6e44d9f@linux.ibm.com>
Date:   Mon, 10 Oct 2022 11:59:07 +0530
From:   Ganesh <ganeshgr@...ux.ibm.com>
To:     saeedm@...dia.com, tariqt@...dia.com
Cc:     netdev@...r.kernel.org, oohall@...il.com
Subject: mlx cards failing to recover from EEH errors

Hi Saeed/Tariq,

On PPC, Since v5.17-rc1 EEH error recovery is not happening on mlx cards,

Any idea what has gone wrong ?

Log:

[  649.335564] mlx5_core 0041:01:00.0: poll_health:817:(pid 0): Fatal 
error 1 detected
[  649.335588] mlx5_core 0041:01:00.0: print_health_info:423:(pid 0): 
PCI slot is unavailable
[  649.336397] EEH: Recovering PHB#41-PE#10000
[  649.336410] EEH: PE location: N/A, PHB location: N/A
[  649.336416] EEH: Frozen PHB#41-PE#10000 detected
[  649.336422] EEH: Call Trace:
[  649.336426] EEH: [c000000000053b78] __eeh_send_failure_event+0x78/0x150
[  649.336597] EEH: [c00000000004ccf8] eeh_dev_check_failure+0x318/0x6b0
[  649.336607] EEH: [c00000000004d158] eeh_check_failure+0xc8/0x100
[  649.336615] EEH: [c0000000007e1454] ioread32be+0x114/0x180
[  649.336623] EEH: [c008000002eb4308] 
mlx5_health_check_fatal_sensors+0x30/0x190 [mlx5_core]
[  649.336684] EEH: [c008000002eb4c10] poll_health+0x58/0x270 [mlx5_core]
[  649.336740] EEH: [c000000000237098] call_timer_fn+0x58/0x1f0
[  649.336748] EEH: [c000000000238490] run_timer_softirq+0x360/0x7f0
[  649.336756] EEH: [c000000000e971f8] __do_softirq+0x158/0x3d0
[  649.336766] EEH: [c0000000000167c0] do_softirq_own_stack+0x40/0x60
[  649.336775] EEH: [c00000000014dd98] irq_exit+0x178/0x1c0
[  649.336782] EEH: [c0000000000273f4] timer_interrupt+0x194/0x410
[  649.336791] EEH: [c000000000009964] decrementer_common_virt+0x214/0x220
[  649.336805] EEH: [0000009721f89180] 0x9721f89180
[  649.336813] EEH: [c000000000b349bc] dedicated_cede_loop+0x9c/0x1b0
[  649.336825] EEH: [c000000000b31114] cpuidle_enter_state+0x364/0x570
[  649.336835] EEH: [c000000000b313c0] cpuidle_enter+0x50/0x70
[  649.336845] EEH: [c0000000001bac68] do_idle+0x3a8/0x420
[  649.336855] EEH: [c0000000001baf18] cpu_startup_entry+0x38/0x40
[  649.336869] EEH: [c0000000000125bc] rest_init+0xec/0xf0
[  649.336878] EEH: [c000000002004274] arch_post_acpi_subsys_init+0x0/0xc
[  649.336889] EEH: [c000000002004cc4] start_kernel+0xa14/0xa64
[  649.336900] EEH: [c00000000000d780] start_here_common+0x1c/0x9c
[  649.336909] EEH: This PCI device has failed 1 times in the last hour 
and will be permanently disabled after 5 failures.
[  649.336921] EEH: Notify device drivers to shutdown
[  649.336929] EEH: Beginning: 'error_detected(IO frozen)'
[  649.336937] PCI 0041:01:00.0#10000: EEH: Invoking 
mlx5_core->error_detected(IO frozen)
[  649.336943] mlx5_core 0041:01:00.0: mlx5_pci_err_detected Device 
state = 2 health sensors: 1 pci_status: 1. Enter, pci channel state = 2
[  649.336967] mlx5_core 0041:01:00.0: mlx5_error_sw_reset:243:(pid 
198): start
[  649.339914] mlx5_core 0041:01:00.0: mlx5_crdump_fill:34:(pid 206): 
failed to read full dump, read 32 out of 1179644
[  649.340586] mlx5_core 0041:01:00.0: mlx5_health_try_recover:335:(pid 
206): handling bad device here
[  650.695123] mlx5_core 0041:01:00.1: poll_health:817:(pid 0): Fatal 
error 1 detected
[  650.695140] mlx5_core 0041:01:00.1: print_health_info:423:(pid 0): 
PCI slot is unavailable
[  650.698918] mlx5_core 0041:01:00.1: mlx5_crdump_fill:34:(pid 1016): 
failed to read full dump, read 32 out of 1179644
[  650.699581] mlx5_core 0041:01:00.1: mlx5_health_try_recover:335:(pid 
1016): handling bad device here
[  650.699594] mlx5_core 0041:01:00.1: mlx5_error_sw_reset:243:(pid 
1016): start
[  651.375132] mlx5_core 0041:01:00.0: NIC IFC still 7 after 2000ms.
[  651.375149] mlx5_core 0041:01:00.0: mlx5_error_sw_reset:276:(pid 
198): end
[  651.375171] mlx5_core 0041:01:00.0: mlx5_error_sw_reset:243:(pid 
206): start
[  652.735131] mlx5_core 0041:01:00.1: NIC IFC still 7 after 2000ms.
[  652.735143] mlx5_core 0041:01:00.1: mlx5_error_sw_reset:276:(pid 
1016): end
[  653.075350] mlx5_core 0041:01:00.1: mlx5_wait_for_pages:759:(pid 
1016): Skipping wait for vf pages stage
[  653.135142] mlx5_core 0041:01:00.1: E-Switch: Disable: mode(LEGACY), 
nvfs(0), active vports(0)
[  653.395129] mlx5_core 0041:01:00.0: NIC IFC still 7 after 2000ms.
[  653.395151] mlx5_core 0041:01:00.0: mlx5_error_sw_reset:276:(pid 
206): end
[  653.765354] mlx5_core 0041:01:00.0: mlx5_wait_for_pages:759:(pid 
206): Skipping wait for vf pages stage
[  653.835135] mlx5_core 0041:01:00.0: E-Switch: Disable: mode(LEGACY), 
nvfs(0), active vports(0)
[  716.685141] mlx5_core 0041:01:00.1: mlx5_health_try_recover:338:(pid 
1016): health recovery flow aborted, PCI reads still not working
[  717.245141] mlx5_core 0041:01:00.0: mlx5_health_try_recover:338:(pid 
206): health recovery flow aborted, PCI reads still not working
[  717.245175] mlx5_core 0041:01:00.0: 
mlx5_unload_one_devl_locked:1505:(pid 198): mlx5_unload_one_devl_locked: 
interface is down, NOP
[  717.245290] mlx5_core 0041:01:00.0: mlx5_pci_err_detected Device 
state = 2 health sensors: 1 pci_status: 0. Exit, result = 3, need reset
[  717.245332] PCI 0041:01:00.0#10000: EEH: mlx5_core driver reports: 
'need reset'
[  717.245338] PCI 0041:01:00.1#10000: EEH: Invoking 
mlx5_core->error_detected(IO frozen)
[  717.245352] mlx5_core 0041:01:00.1: mlx5_pci_err_detected Device 
state = 2 health sensors: 1 pci_status: 1. Enter, pci channel state = 2
[  717.245380] mlx5_core 0041:01:00.1: mlx5_error_sw_reset:243:(pid 
198): start
[  719.265142] mlx5_core 0041:01:00.1: NIC IFC still 7 after 2000ms.
[  719.265160] mlx5_core 0041:01:00.1: mlx5_error_sw_reset:276:(pid 
198): end
[  719.265170] mlx5_core 0041:01:00.1: 
mlx5_unload_one_devl_locked:1505:(pid 198): mlx5_unload_one_devl_locked: 
interface is down, NOP
[  719.265350] mlx5_core 0041:01:00.1: mlx5_pci_err_detected Device 
state = 2 health sensors: 1 pci_status: 0. Exit, result = 3, need reset
[  719.265385] PCI 0041:01:00.1#10000: EEH: mlx5_core driver reports: 
'need reset'
[  719.265391] EEH: Finished:'error_detected(IO frozen)' with aggregate 
recovery state:'need reset'
[  719.265526] EEH: Collect temporary log
[  719.267450] EEH: of node=0041:01:00.0
[  719.267543] EEH: PCI device/vendor: 101515b3
[  719.267616] EEH: PCI cmd/status register: 00100140
[  719.267625] EEH: PCI-E capabilities and status follow:
[  719.267962] EEH: PCI-E 00: 00024810 10008fe2 0009595f 00417883
[  719.268232] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[  719.268241] EEH: PCI-E 20: 00000000
[  719.268245] EEH: PCI-E AER capability register set follows:
[  719.268567] EEH: PCI-E AER 00: 15010001 00000000 00000000 00062010
[  719.268827] EEH: PCI-E AER 10: 00002000 00002000 000001e4 00000000
[  719.269087] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[  719.269157] EEH: PCI-E AER 30: 00000000 00000000
[  719.269165] EEH: of node=0041:01:00.1
[  719.269234] EEH: PCI device/vendor: 101515b3
[  719.269305] EEH: PCI cmd/status register: 00100142
[  719.269312] EEH: PCI-E capabilities and status follow:
[  719.269649] EEH: PCI-E 00: 00024810 10008fe2 0009595f 00417883
[  719.269913] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[  719.269921] EEH: PCI-E 20: 00000000
[  719.269926] EEH: PCI-E AER capability register set follows:
[  719.270245] EEH: PCI-E AER 00: 15010001 00000000 00000000 00062010
[  719.270507] EEH: PCI-E AER 10: 00002000 00002000 000001e4 00000000
[  719.270765] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[  719.270835] EEH: PCI-E AER 30: 00000000 00000000
[  719.272823] EEH: Reset without hotplug activity
[  721.637535] EEH: Beginning: 'slot_reset'
[  721.637547] PCI 0041:01:00.0#10000: EEH: Invoking mlx5_core->slot_reset()
[  721.637907] mlx5_core 0041:01:00.0: mlx5_pci_slot_reset Device state 
= 2 health sensors: 1 pci_status: 0. Enter
[  721.638058] mlx5_core 0041:01:00.0: enabling device (0140 -> 0142)
[  728.645144] mlx5_core 0041:01:00.0: mlx5_pci_slot_reset:1856:(pid 
198): mlx5_pci_slot_reset: wait vital failed with error code: -110
[  728.645164] mlx5_core 0041:01:00.0: mlx5_pci_slot_reset Device state 
= 2 health sensors: 1 pci_status: 1. Exit, err = -110, result = 4, 
disconnect
[  728.645177] PCI 0041:01:00.0#10000: EEH: mlx5_core driver reports: 
'disconnect'
[  728.645182] PCI 0041:01:00.1#10000: EEH: Invoking mlx5_core->slot_reset()
[  728.645191] mlx5_core 0041:01:00.1: mlx5_pci_slot_reset Device state 
= 2 health sensors: 1 pci_status: 0. Enter
[  728.645351] mlx5_core 0041:01:00.1: enabling device (0140 -> 0142)
[  735.645147] mlx5_core 0041:01:00.1: mlx5_pci_slot_reset:1856:(pid 
198): mlx5_pci_slot_reset: wait vital failed with error code: -110
[  735.645169] mlx5_core 0041:01:00.1: mlx5_pci_slot_reset Device state 
= 2 health sensors: 1 pci_status: 1. Exit, err = -110, result = 4, 
disconnect
[  735.645182] PCI 0041:01:00.1#10000: EEH: mlx5_core driver reports: 
'disconnect'
[  735.645187] EEH: Finished:'slot_reset' with aggregate recovery 
state:'disconnect'
[  735.645199] EEH: Unable to recover from failure from PHB#41-PE#10000.
[  735.645199] Please try reseating or replacing it
[  735.647109] EEH: of node=0041:01:00.0
[  735.647180] EEH: PCI device/vendor: 101515b3
[  735.647251] EEH: PCI cmd/status register: 00100546
[  735.647259] EEH: PCI-E capabilities and status follow:
[  735.647592] EEH: PCI-E 00: 00024810 10008fe2 0009595f 00417883
[  735.647858] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[  735.647866] EEH: PCI-E 20: 00000000
[  735.647870] EEH: PCI-E AER capability register set follows:
[  735.648195] EEH: PCI-E AER 00: 15010001 00000000 00000000 00062010
[  735.648456] EEH: PCI-E AER 10: 00000000 00002000 000001e4 00000000
[  735.648713] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[  735.648783] EEH: PCI-E AER 30: 00000000 00000000
[  735.648791] EEH: of node=0041:01:00.1
[  735.648861] EEH: PCI device/vendor: 101515b3
[  735.648930] EEH: PCI cmd/status register: 00100546
[  735.648937] EEH: PCI-E capabilities and status follow:
[  735.649267] EEH: PCI-E 00: 00024810 10008fe2 0009595f 00417883
[  735.649533] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[  735.649540] EEH: PCI-E 20: 00000000
[  735.649544] EEH: PCI-E AER capability register set follows:
[  735.649864] EEH: PCI-E AER 00: 15010001 00000000 00000000 00062010
[  735.650127] EEH: PCI-E AER 10: 00000000 00002000 000001e4 00000000
[  735.650386] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[  735.650456] EEH: PCI-E AER 30: 00000000 00000000
[  735.652599] EEH: Beginning: 'error_detected(permanent failure)'
[  735.652608] PCI 0041:01:00.0#10000: EEH: not actionable (1,1,1)
[  735.652612] PCI 0041:01:00.1#10000: EEH: not actionable (1,1,1)
[  735.652619] EEH: Finished:'error_detected(permanent failure)'
[  735.725229] mlx5_core 0041:01:00.1: E-Switch: Unload vfs: 
mode(LEGACY), nvfs(0), active vports(0)
[  735.725265] mlx5_core 0041:01:00.1: mlx5_wait_for_pages:759:(pid 
198): Skipping wait for vf pages stage
[  736.005168] mlx5_core 0041:01:00.1: mlx5_uninit_one:1424:(pid 198): 
mlx5_uninit_one: interface is down, NOP
[  736.005425] mlx5_core 0041:01:00.1: E-Switch: cleanup
[  736.165474] pci 0041:01:00.1: Removing from iommu group 0
[  736.265197] mlx5_core 0041:01:00.0: E-Switch: Unload vfs: 
mode(LEGACY), nvfs(0), active vports(0)
[  736.265214] mlx5_core 0041:01:00.0: mlx5_wait_for_pages:759:(pid 
198): Skipping wait for vf pages stage
[  736.515164] mlx5_core 0041:01:00.0: mlx5_uninit_one:1424:(pid 198): 
mlx5_uninit_one: interface is down, NOP
[  736.515366] mlx5_core 0041:01:00.0: E-Switch: cleanup
[  736.605455] pci 0041:01:00.0: Removing from iommu group 0
[  736.605562] eeh_handle_normal_event: Cannot find PCI bus for 
PHB#41-PE#10000

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ