[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <f7c43842-270e-48f8-ba89-9b5e67910131@yahoo.com>
Date: Mon, 4 Nov 2024 05:34:42 -0600
From: Dullfire <dullfire@...oo.com>
To: davem@...emloft.net, sparclinux@...r.kernel.org, netdev@...r.kernel.org,
linux-pci@...r.kernel.org
Subject: Kernel panic with niu module
Hello,
I am working on a set of patches that address a panic on bind in the niu
module. However, none of the approaches I see integrate well with the kernels
frameworks, so any feed back you could provide would be appreciated.
On sparcv9 systems (and possibly others), when the niu drivers sets up the
MSIX IRQ vectors, a fatal trap[0] is encountered. I have done a number of
tests[1]. From these tests I have believe that any read from a specific MSIX
table entry must come after a write to it's PCI_MSIX_ENTRY_DATA field,
otherwise it will cause a fatal trap.
I see types of approaches:
1) Add writes to the ENTRY_DATA field to niu before it call into the
msi(x) code.
2) Adjust the MSIX code to either skip the read, or write to ENTRY_DATA first
3) Add a PCI quirk for this device to "initialize" the MSIX vector table.
Approach 1 encounters issues in needing to write to the MSIX table. The
functions needed to do this are internal to msi.c (or drivers/pci/msi/msi.h),
so they would have to either be reproduces in niu, or exposed in a public
header. Neither of those seem like a good approach to me.
Approach 2 can be done in a small amount of code, but it would either require
the addition of a struct pci_dev flag of some sort, or it would be invasive
to lots of other devices.
While approach 3 seems to be the most correct location, it suffers many of
the same issues as approach 1.
I have also bisected the kernel, and determined that upstream commit
7d5ec3d3612396dc6d4b76366d20ab9fc06f399f revealed this issue. This commit
adds read to the mask status before any write to PCI_MSIX_ENTRY_DATA, thus
provoking the issue.
If you have any suggestions, please let me know.
Regards,
Jonathan Currier
[0] The trap looks like this:
-----------------------------------------------------------------------------
[ 25.166817] niu: niu.c:v1.1 (Apr 22, 2010)
[ 25.166952] niu 0001:04:00.0: enabling device (0144 -> 0146)
[ 25.174100] niu: niu0: Found PHY 002063b0 type MII at phy_port 26
[ 25.174559] niu: niu0: Found PHY 002063b0 type MII at phy_port 27
[ 25.175004] niu: niu0: Found PHY 002063b0 type MII at phy_port 28
[ 25.175449] niu: niu0: Found PHY 002063b0 type MII at phy_port 29
[ 25.176298] niu: niu0: Port 0 [4 RX chans] [6 TX chans]
[ 25.176405] niu: niu0: Port 1 [4 RX chans] [6 TX chans]
[ 25.176507] niu: niu0: Port 2 [4 RX chans] [6 TX chans]
[ 25.176548] niu: niu0: Port 3 [4 RX chans] [6 TX chans]
[ 25.176590] niu: niu0: Port 0 RDC tbl(0) [ 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 ]
[ 25.176757] niu: niu0: Port 0 RDC tbl(1) [ 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 ]
[ 25.176890] niu: niu0: Port 1 RDC tbl(2) [ 4 5 6 7 4 5 6 7 4 5 6 7 4 5 6 7 ]
[ 25.177053] niu: niu0: Port 1 RDC tbl(3) [ 4 5 6 7 4 5 6 7 4 5 6 7 4 5 6 7 ]
[ 25.177185] niu: niu0: Port 2 RDC tbl(4) [ 8 9 10 11 8 9 10 11 8 9 10 11 8 9 10 11 ]
[ 25.177349] niu: niu0: Port 2 RDC tbl(5) [ 8 9 10 11 8 9 10 11 8 9 10 11 8 9 10 11 ]
[ 25.177483] niu: niu0: Port 3 RDC tbl(6) [ 12 13 14 15 12 13 14 15 12 13 14 15 12 13 14 15 ]
[ 25.177649] niu: niu0: Port 3 RDC tbl(7) [ 12 13 14 15 12 13 14 15 12 13 14 15 12 13 14 15 ]
[ 25.245863] NON-RESUMABLE ERROR: Reporting on cpu 64
[ 25.245973] NON-RESUMABLE ERROR: TPC [0x00000000005f6900] <msix_prepare_msi_desc+0x90/0xa0>
[ 25.246106] NON-RESUMABLE ERROR: RAW [4010000000000016:00000e37f93e32ff:0000000202000080:ffffffffffffffff
[ 25.246215] NON-RESUMABLE ERROR: 0000000800000000:0000000000000000:0000000000000000:0000000000000000]
[ 25.246291] NON-RESUMABLE ERROR: handle [0x4010000000000016] stick [0x00000e37f93e32ff]
[ 25.246335] NON-RESUMABLE ERROR: type [precise nonresumable]
[ 25.246373] NON-RESUMABLE ERROR: attrs [0x02000080] < ASI sp-faulted priv >
[ 25.246435] NON-RESUMABLE ERROR: raddr [0xffffffffffffffff]
[ 25.246476] NON-RESUMABLE ERROR: insn effective address [0x000000c50020000c]
[ 25.246517] NON-RESUMABLE ERROR: size [0x8]
[ 25.246544] NON-RESUMABLE ERROR: asi [0x00]
[ 25.246573] CPU: 64 UID: 0 PID: 745 Comm: kworker/64:1 Not tainted 6.11.5 #63
[ 25.246625] Workqueue: events work_for_cpu_fn
[ 25.246671] TSTATE: 0000000011001602 TPC: 00000000005f6900 TNPC: 00000000005f6904 Y: 00000000 Not tainted
[ 25.246729] TPC: <msix_prepare_msi_desc+0x90/0xa0>
[ 25.246771] g0: 00000000000002e9 g1: 000000000000000c g2: 000000c50020000c g3: 0000000000000100
[ 25.246815] g4: ffff8000470307c0 g5: ffff800fec5be000 g6: ffff800047a08000 g7: 0000000000000000
[ 25.246861] o0: ffff800014feb000 o1: ffff800047a0b620 o2: 0000000000000011 o3: ffff800047a0b620
[ 25.246906] o4: 0000000000000080 o5: 0000000000000011 sp: ffff800047a0ad51 ret_pc: 00000000005f7128
[ 25.246951] RPC: <__pci_enable_msix_range+0x3cc/0x460>
[ 25.247004] l0: 000000000000000d l1: 000000000000c01f l2: ffff800014feb0a8 l3: 0000000000000020
[ 25.247049] l4: 000000000000c000 l5: 0000000000000001 l6: 0000000020000000 l7: ffff800047a0b734
[ 25.247094] i0: ffff800014feb000 i1: ffff800047a0b730 i2: 0000000000000001 i3: 000000000000000d
[ 25.247138] i4: 0000000000000000 i5: 0000000000000000 i6: ffff800047a0ae81 i7: 00000000101888b0
[ 25.247182] I7: <niu_try_msix.constprop.0+0xc0/0x130 [niu]>
[ 25.247321] Call Trace:
[ 25.247346] [<00000000101888b0>] niu_try_msix.constprop.0+0xc0/0x130 [niu]
[ 25.247442] [<000000001018f840>] niu_get_invariants+0x183c/0x207c [niu]
[ 25.247536] [<00000000101902fc>] niu_pci_init_one+0x27c/0x2fc [niu]
[ 25.247630] [<00000000005ef3e4>] local_pci_probe+0x28/0x74
[ 25.247677] [<0000000000469240>] work_for_cpu_fn+0x8/0x1c
[ 25.247726] [<000000000046b008>] process_scheduled_works+0x144/0x210
[ 25.247782] [<000000000046b518>] worker_thread+0x13c/0x1c0
[ 25.247833] [<00000000004710e0>] kthread+0xb8/0xc8
[ 25.247874] [<00000000004060c8>] ret_from_fork+0x1c/0x2c
[ 25.247931] [<0000000000000000>] 0x0
[ 25.247961] Kernel panic - not syncing: Non-resumable error.
-----------------------------------------------------------------------------
[1] Tests I have done (and their results)
All tests done on a T5240 (UltaSPARC T2).
In my test cases: niu driver tries to use up to 13 vectors (table size of 32 entries).
"SUCCCESS" - Test case booted with functional networking
"FAILED" - Test case experienced a fatal trap after loading the niu module
- writing 0 to all of the MSIX table entrys' ENTRY_VECTOR_CTRL: FAILED
- writing 0 to all of the MSIX table entrys' ENTRY_LOWER_ADDR: FAILED
- writing 0 to all of the MSIX table entrys' ENTRY_UPPER_ADDR: FAILED
- writing 0 to all of the MSIX table entrys' ENTRY_DATA: SUCCESS
- writing 0 to only one MSIX table entry's ENTRY_DATA: FAILED
- writing 0 to only the first 1/2 of the MSIX table entrys' ENTRY_DATA: SUCCESS
- writing ~0 to only the first 1/2 of the MSIX table entrys' ENTRY_DATA: SUCCESS
- writing 0 to only the first 12 of the MSIX table entrys' ENTRY_DATA: FAILED
- writing 0 to only the first 13 of the MSIX table entrys' ENTRY_DATA: SUCCESS
- reading ENTRY_DATA before writing it: FAILED
- reading ENTRY_LOWER_ADDR before writing it: FAILED
- reading ENTRY_UPPER_ADDR before writing it: FAILED
Powered by blists - more mailing lists