[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <16234667-a9fd-4530-853f-ce594670f5dc@samsung.com>
Date: Tue, 1 Apr 2025 22:34:40 +0200
From: Marek Szyprowski <m.szyprowski@...sung.com>
To: Robin Murphy <robin.murphy@....com>, Lorenzo Pieralisi
<lpieralisi@...nel.org>, Hanjun Guo <guohanjun@...wei.com>, Sudeep Holla
<sudeep.holla@....com>, "Rafael J. Wysocki" <rafael@...nel.org>, Len Brown
<lenb@...nel.org>, Russell King <linux@...linux.org.uk>, Greg Kroah-Hartman
<gregkh@...uxfoundation.org>, Danilo Krummrich <dakr@...nel.org>, Stuart
Yoder <stuyoder@...il.com>, Nipun Gupta <nipun.gupta@....com>, Nikhil
Agarwal <nikhil.agarwal@....com>, Joerg Roedel <joro@...tes.org>, Will
Deacon <will@...nel.org>, Rob Herring <robh@...nel.org>, Saravana Kannan
<saravanak@...gle.com>, Bjorn Helgaas <bhelgaas@...gle.com>
Cc: linux-acpi@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, iommu@...ts.linux.dev,
devicetree@...r.kernel.org, linux-pci@...r.kernel.org, Charan Teja Kalla
<quic_charante@...cinc.com>
Subject: Re: [PATCH v2 4/4] iommu: Get DT/ACPI parsing into the proper probe
path
On 21.03.2025 17:48, Robin Murphy wrote:
> On 21/03/2025 12:15 pm, Marek Szyprowski wrote:
>> On 17.03.2025 19:22, Robin Murphy wrote:
>>> On 17/03/2025 7:37 am, Marek Szyprowski wrote:
>>>> On 13.03.2025 15:12, Robin Murphy wrote:
>>>>> On 2025-03-13 1:06 pm, Robin Murphy wrote:
>>>>>> On 2025-03-13 12:23 pm, Marek Szyprowski wrote:
>>>>>>> On 13.03.2025 12:01, Robin Murphy wrote:
>>>>>>>> On 2025-03-13 9:56 am, Marek Szyprowski wrote:
>>>>>>>> [...]
>>>>>>>>> This patch landed in yesterday's linux-next as commit
>>>>>>>>> bcb81ac6ae3c
>>>>>>>>> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my
>>>>>>>>> tests I
>>>>>>>>> found it breaks booting of ARM64 RK3568-based Odroid-M1 board
>>>>>>>>> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the
>>>>>>>>> relevant kernel log:
>>>>>>>>
>>>>>>>> ...and the bug-flushing-out begins!
>>>>>>>>
>>>>>>>>> Unable to handle kernel NULL pointer dereference at virtual
>>>>>>>>> address
>>>>>>>>> 00000000000003e8
>>>>>>>>> Mem abort info:
>>>>>>>>> ESR = 0x0000000096000004
>>>>>>>>> EC = 0x25: DABT (current EL), IL = 32 bits
>>>>>>>>> SET = 0, FnV = 0
>>>>>>>>> EA = 0, S1PTW = 0
>>>>>>>>> FSC = 0x04: level 0 translation fault
>>>>>>>>> Data abort info:
>>>>>>>>> ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>>>>>>>> CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>>>>>>>> GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>>>>>>>> [00000000000003e8] user address but active_mm is swapper
>>>>>>>>> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>>>>>>>>> Modules linked in:
>>>>>>>>> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+
>>>>>>>>> #15533
>>>>>>>>> Hardware name: Hardkernel ODROID-M1 (DT)
>>>>>>>>> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>>>>> pc : devm_kmalloc+0x2c/0x114
>>>>>>>>> lr : rk_iommu_of_xlate+0x30/0x90
>>>>>>>>> ...
>>>>>>>>> Call trace:
>>>>>>>>> devm_kmalloc+0x2c/0x114 (P)
>>>>>>>>> rk_iommu_of_xlate+0x30/0x90
>>>>>>>>
>>>>>>>> Yeah, looks like this is doing something a bit questionable which
>>>>>>>> can't
>>>>>>>> work properly. TBH the whole dma_dev thing could probably be
>>>>>>>> cleaned up
>>>>>>>> now that we have proper instances, but for now does this work?
>>>>>>>
>>>>>>> Yes, this patch fixes the problem I've observed.
>>>>>>>
>>>>>>> Reported-by: Marek Szyprowski <m.szyprowski@...sung.com>
>>>>>>> Tested-by: Marek Szyprowski <m.szyprowski@...sung.com>
>>>>>>>
>>>>>>> BTW, this dma_dev idea has been borrowed from my exynos_iommu
>>>>>>> driver
>>>>>>> and
>>>>>>> I doubt it can be cleaned up.
>>>>>>
>>>>>> On the contrary I suspect they both can - it all dates back to when
>>>>>> we had the single global platform bus iommu_ops and the SoC drivers
>>>>>> were forced to bodge their own notion of multiple instances, but
>>>>>> with
>>>>>> the modern core code, ops are always called via a valid IOMMU
>>>>>> instance or domain, so in principle it should always be possible to
>>>>>> get at an appropriate IOMMU device now. IIRC it was mostly about
>>>>>> allocating and DMA-mapping the pagetables in domain_alloc, where the
>>>>>> private notion of instances didn't have enough information, but
>>>>>> domain_alloc_paging solves that.
>>>>>
>>>>> Bah, in fact I think I am going to have to do that now, since
>>>>> although
>>>>> it doesn't crash, rk_domain_alloc_paging() will also be failing for
>>>>> the same reason. Time to find a PSU for the RK3399 board, I guess...
>>>>>
>>>>> (Or maybe just move the dma_dev assignment earlier to match Exynos?)
>>>>
>>>> Well I just found that Exynos IOMMU is also broken on some on my test
>>>> boards. It looks that the runtime pm links are somehow not correctly
>>>> established. I will try to analyze this later in the afternoon.
>>>
>>> Hmm, I tried to get an Odroid-XU3 up and running, but it seems unable
>>> to boot my original 6.14-rc3-based branch - even with the IOMMU driver
>>> disabled, it's consistently dying somewhere near (or just after) init
>>> with what looks like some catastrophic memory corruption issue - very
>>> occasionally it's managed to print the first line of various different
>>> panics.
>>>
>>> Before that point though, with the IOMMU driver enabled it does appear
>>> to show signs of working OK:
>>>
>>> [ 0.649703] exynos-sysmmu 14650000.sysmmu: hardware version: 3.3
>>> [ 0.654220] platform 14450000.mixer: Adding to iommu group 1
>>> ...
>>> [ 2.680920] exynos-mixer 14450000.mixer:
>>> exynos_iommu_attach_device: Attached IOMMU with pgtable 0x42924000
>>> ...
>>> [ 5.196674] exynos-mixer 14450000.mixer:
>>> exynos_iommu_identity_attach: Restored IOMMU to IDENTITY from pgtable
>>> 0x42924000
>>> [ 5.207091] exynos-mixer 14450000.mixer:
>>> exynos_iommu_attach_device: Attached IOMMU with pgtable 0x42884000
>>>
>>>
>>> The multi-instance stuff in probe/release does look a bit suspect,
>>> however - seems like the second instance probe would overwrite the
>>> first instance's links, and then there would be a double-del() if the
>>> device were ever actually released again? I may have made that much
>>> more likely to happen, but I suspect it was already possible with
>>> async driver probe...
>>
>> That is really strange. My Odroid XU3 boots fine from commit
>> bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe path"),
>> although the IOMMU seems not to be working correctly. I've tested this
>> with 14450000.mixer device (one need to attach HDMI cable to get it
>> activated) and it looks that the video data are not being read from
>> memory at all (the lack of VSYNC is reported, no IOMMU fault). However,
>> from time to time, everything initializes and works properly.
>
> Urgh, seems my mistake was assuming exynos_defconfig was the right
> thing to begin from - bcb81ac6ae3c with that still dies in the same
> way (this time I saw a hint of spin_bug() being hit...), however a
> multi_v7_defconfig build does get to userspace OK again with no
> obvious signs of distress:
>
> [root@...rm ~]# grep -Hr . /sys/kernel/iommu_groups/*/type
> /sys/kernel/iommu_groups/0/type:identity
> /sys/kernel/iommu_groups/1/type:identity
> /sys/kernel/iommu_groups/10/type:identity
> /sys/kernel/iommu_groups/2/type:identity
> /sys/kernel/iommu_groups/3/type:identity
> /sys/kernel/iommu_groups/4/type:identity
> /sys/kernel/iommu_groups/5/type:identity
> /sys/kernel/iommu_groups/6/type:identity
> /sys/kernel/iommu_groups/7/type:identity
> /sys/kernel/iommu_groups/8/type:identity
> /sys/kernel/iommu_groups/9/type:identity
>
> Annoyingly I do have an adapter for the fiddly micro-HDMI, but it's at
> home :(
>
>> It looks that this is somehow related to the different IOMMU/DMA-mapping
>> glue code, as the other boards (ARM64 based) with exactly the same
>> Exynos IOMMU driver always work fine. I've tried to figure out what
>> actually happens, but so far I didn't get anything for sure. Disabling
>> the call to dev->bus->dma_configure(dev) from iommu_init_device() seems
>> to be fixing this, but this is almost equal to the revert of the
>> $subject patch. I don't get why calling it in iommu_init_device() causes
>> problems. It also doesn't look that this is anyhow related to the
>> multi-instance stuff, as the same happens if I only leave a single
>> exynos-sysmmu instance and its client (only 14450000.mixer device in the
>> system).
>
> On a hunch I stuck a print in exynos_iommu_probe_device(), and it
> looks like in fact device_link_add() isn't getting called at all, and
> indeed your symptoms do sound like they could be explained by the
> IOMMU not being reliably resumed... lemme stare at
> exynos_iommu_of_xlate() a bit longer...
Just to let everyone know. The $subject change is okay. This is a bug in
exynos-iommu driver, fixed by the following patch:
https://lore.kernel.org/all/20250401202731.2810474-1-m.szyprowski@samsung.com/
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
Powered by blists - more mailing lists