[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240821165034.1af97bad@fedora-3.home>
Date: Wed, 21 Aug 2024 16:50:34 +0200
From: Maxime Chevallier <maxime.chevallier@...tlin.com>
To: Thomas Gleixner <tglx@...utronix.de>, Andrew Lunn <andrew@...n.ch>,
Gregory Clement <gregory.clement@...tlin.com>, Sebastian Hesselbarth
<sebastian.hesselbarth@...il.com>, Russell King <linux@...linux.org.uk>
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
Thomas Petazzoni <thomas.petazzoni@...tlin.com>
Subject: Regression on Macchiatobin from the irqchip driver
Hi everyone,
I've been testing out some network series on the Macchiatobin (Armada
8k SoC) and I stumbled upon a crash at boot, that showed-up on the
latest net-next branch :
[ 2.755698] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
[ 2.757592] mmcblk0: mmc0:0001 8GME4R 7.28 GiB
[ 2.766033] Mem abort info:
[ 2.766036] ESR = 0x0000000096000004
[ 2.774534] mmcblk0: p1
[ 2.777086] EC = 0x25: DABT (current EL), IL = 32 bits
[ 2.779893] mmcblk0boot0: mmc0:0001 8GME4R 4.00 MiB
[ 2.784965] SET = 0, FnV = 0
[ 2.784969] EA = 0, S1PTW = 0
[ 2.784972] FSC = 0x04: level 0 translation fault
[ 2.784976] Data abort info:
[ 2.790648] mmcblk0boot1: mmc0:0001 8GME4R 4.00 MiB
[ 2.792943] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[ 2.796867] mmcblk0rpmb: mmc0:0001 8GME4R 512 KiB, chardev (234:0)
[ 2.801002] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 2.801006] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 2.830960] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000101a75000
[ 2.837436] [0000000000000008] pgd=0000000000000000, p4d=0000000000000000
[ 2.844265] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[ 2.850560] Modules linked in:
[ 2.853631] CPU: 2 UID: 0 PID: 51 Comm: kworker/u18:2 Not tainted 6.10.0-12649-g25010bfdf8bb #10
[ 2.862457] Hardware name: Marvell 8040 MACCHIATOBin Double-shot (DT)
[ 2.868926] Workqueue: events_unbound deferred_probe_work_func
[ 2.874800] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 2.881794] pc : msi_lib_irq_domain_select+0x28/0x58
[ 2.886786] lr : irq_find_matching_fwspec+0xf0/0x120
[ 2.891778] sp : ffff8000871f39e0
[ 2.895107] x29: ffff8000871f39e0 x28: 0000000000000000 x27: ffff00013f7e11e8
[ 2.902281] x26: 0000000000000004 x25: ffff00013f7e11d0 x24: ffff800086764dd0
[ 2.909453] x23: ffff800081633b28 x22: ffff8000871f3a68 x21: 0000000000000001
[ 2.916624] x20: ffff800086764df0 x19: ffff000101171000 x18: ffffffffffffffff
[ 2.923797] x17: ffff000100de8a0c x16: ffff000100de8a00 x15: ffff000100da9cca
[ 2.930969] x14: ffffffffffffffff x13: 0030354072656c6c x12: 6f72746e6f632d74
[ 2.938141] x11: 7075727265746e69 x10: 0000000000036018 x9 : 0000000000000001
[ 2.945314] x8 : ffff8000871f3ab8 x7 : 0000000000000000 x6 : 4d0c1e0cade3e5ec
[ 2.952486] x5 : 6c65632d0c1e0c4d x4 : ffff00013f7e11e8 x3 : ffff000101171000
[ 2.959659] x2 : 0000000000000004 x1 : 0000000000000000 x0 : 0000000000000000
[ 2.966831] Call trace:
[ 2.969288] msi_lib_irq_domain_select+0x28/0x58
[ 2.973928] irq_find_matching_fwnode+0x4c/0x78
[ 2.978484] of_msi_get_domain+0x11c/0x138
[ 2.982602] mvebu_icu_subset_probe+0x5c/0x124
[ 2.987068] platform_probe+0x68/0xdc
[ 2.990748] really_probe+0xbc/0x2a4
[ 2.994343] __driver_probe_device+0x78/0x12c
[ 2.998722] driver_probe_device+0xdc/0x160
[ 3.002926] __device_attach_driver+0xb8/0x134
[ 3.007392] bus_for_each_drv+0x80/0xdc
[ 3.011248] __device_attach+0xa8/0x1b0
[ 3.015103] device_initial_probe+0x14/0x20
[ 3.019307] bus_probe_device+0xa8/0xac
[ 3.023162] deferred_probe_work_func+0x88/0xc0
[ 3.027714] process_one_work+0x150/0x294
[ 3.031743] worker_thread+0x2e4/0x3ec
[ 3.035510] kthread+0x118/0x11c
[ 3.038756] ret_from_fork+0x10/0x20
[ 3.042353] Code: d65f03c0 b9400820 35ffffa0 f9404461 (b9400823)
[ 3.048473] ---[ end trace 0000000000000000 ]---
I bisected the bug and the crash appeared at :
fbdf14e90ce4 ("irqchip/irq-mvebu-sei: Switch to MSI parent")
I've briefly looked at it, and it seems the NULL pointer that's being
dereferenced here is the "ops" pointer in msi_lib_irq_domain_select [1]
I'm not very familiar with the irqchip subsystem, my best guess
is that this is being called for the ap_domain, in the irq-mvebu-sei
driver, which doesn't have any msi_parent_ops set [2].
By looking at the msi_lib_irq_domain_select() implementation however, I
notice that it appears to be expected that these ops can be NULL by
looking at the check in the return line :
return ops && !!(ops->bus_select_mask & busmask);
However, the line above dereferences the ops pointer without prior
check :
/* Handle pure domain searches */
if (bus_token == ops->bus_select_token)
return 1;
As I said, this area of the kernel isn't very familiar to me, but I got
my board to boot with the following patch :
--- a/drivers/irqchip/irq-msi-lib.c
+++ b/drivers/irqchip/irq-msi-lib.c
@@ -128,6 +128,9 @@ int msi_lib_irq_domain_select(struct irq_domain *d, struct irq_fwspec *fwspec,
const struct msi_parent_ops *ops = d->msi_parent_ops;
u32 busmask = BIT(bus_token);
+ if (!ops)
+ return 0;
+
if (fwspec->fwnode != d->fwnode || fwspec->param_count != 0)
return 0;
@@ -135,6 +138,6 @@ int msi_lib_irq_domain_select(struct irq_domain *d, struct irq_fwspec *fwspec,
if (bus_token == ops->bus_select_token)
return 1;
- return ops && !!(ops->bus_select_mask & busmask);
+ return !!(ops->bus_select_mask & busmask);
----------------------------
I have zero confidence that this is the correct solution to the issue
so feel free to ditch that solution :) I'll gladly test any
patch for that on the MCBIN.
Let me know if you want me to run more tests.
Thanks,
Maxime
[1] : https://elixir.bootlin.com/linux/v6.11-rc4/source/drivers/irqchip/irq-msi-lib.c#L125
[2] : https://elixir.bootlin.com/linux/v6.11-rc4/source/drivers/irqchip/irq-mvebu-sei.c#L423
Powered by blists - more mailing lists