[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ce6d471a-a863-45ca-8a35-66dbbabe47c1@arm.com>
Date: Wed, 5 Feb 2025 18:53:23 +0000
From: Robin Murphy <robin.murphy@....com>
To: Corentin Labbe <clabbe.montjoie@...il.com>
Cc: joro@...tes.org, suravee.suthikulpanit@....com, will@...nel.org,
iommu@...ts.linux.dev, linux-kernel@...r.kernel.org,
Vasant Hegde <vasant.hegde@....com>
Subject: Re: iommu: flood of ahci 0000:e6:00.0: AMD-Vi: Event logged
[IO_PAGE_FAULT domain=0x0055 address=0xa14a4000 flags=0x0070]
On 2025-02-05 1:36 pm, Corentin Labbe wrote:
> Le Mon, Feb 03, 2025 at 01:01:45PM +0000, Robin Murphy a écrit :
>> On 2025-02-03 9:05 am, Corentin Labbe wrote:
>>> Hello
>>>
>>> I have a supermicro server which is flooded of kernel message:
>>> ahci 0000:e6:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0055 address=0xa14a4000 flags=0x0070]
>>>
>>> The server works perfectly anyway.
>>> It happens with official ubuntu kernel vmlinuz-6.8.0-51-generic.
>>> I tried also a custom 6.12.6, same problem.
>>>
>>> I tried to update bios, no change.
>>> I tried iommu=soft, no change.
>>>
>>> I dont know what to do next.
>>>
>>> Regards
>>>
>>
>>> IOMMU group 83 e6:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9230 PCIe 2.0 x2 4-port SATA 6 Gb/s RAID Controller [1b4b:9230] (rev 11)
>>
>> Wow, a Marvell SATA controller doing something other than the usual
>> phantom function quirk, that's a nice change :D
>>
>> I'd guess that firmware has left it running for something like legacy
>> IDE emulation (if that's still a thing?) or its own soft-RAID driver,
>> but neglected to declare an IVMD entry to described the reserved memory
>> region(s) it's using for that. A smoking gun would be if 0xa14a4000
>> matches some firmware-reserved PA in the system memory map. In that
>> case, if you're lucky you might have some firmware/BIOS option to
>> disable fancy behaviour and leave it in plain AHCI mode. Otherwise,
>> booting with "iommu.passthrough=1" (or the even bigger hammer of
>> "amd_iommu=off") should at least allow you to ignore the issue.
>>
>
> Hello
>
> Thanks for your help
>
> There was no AHCI option in the BIOS (appart hotplug enable).
>
> Adding iommu.passthrough=1 lead to absence of thoses messages.
>
> Unfortunatly, my example is not correct, the address is mostly random:
> dmesg |grep IO_PAGE_FAULT | grep -o 'address=0x[0-9a-f]*' | sort | uniq -c | wc -l
> 9297
>
> dmesg |grep IO_PAGE_FAULT | grep -o 'address=0x[0-9a-f]*' | sort | uniq -c | head
> 2 address=0x1101f000
> 2 address=0x1101f004
> 3 address=0x1102f000
> 1 address=0x1102f004
> 2 address=0x1102f008
> 2 address=0x1102f010
> 2 address=0x11043000
> 2 address=0x11043004
> 1 address=0x11047000
> 1 address=0x11047004
>
> dmesg |grep IO_PAGE_FAULT | grep -o 'address=0x[0-9a-f]*' | sort | uniq -c | tail
> 2 address=0xfffffffffe751004
> 2 address=0xfffffffffe7e6000
> 2 address=0xfffffffffe7e6004
> 4 address=0xfffffffffe823000
> 3 address=0xfffffffffe823004
> 2 address=0xfffffffffe830000
> 2 address=0xfffffffffe830004
> 3 address=0xfffffffffe833000
> 1 address=0xfffffffffe833004
> 1 address=0xfffffffffe833008
OK, these look like iommu-dma addresses, and the fact that they're up
into the full 64-bit space implies that the 32-bit ones are most likely
also kernel DMA burning through the whole 32-bit IOVA space rather than
inadvertent physical address (and possibly the SATA driver is leaking
DMA mappings as it keeps getting errors and retrying?). Indeed it seems
the firmware stuff probably was a red herring.
I guess that then points to a question of whether it's maybe just the
SATA driver going wonky and trying to make the device write to a
DMA_TO_DEVICE mapping, or something going awry at the IOMMU to divert
the device accesses to a different address space from the one iommu-dma
believes it's using...
> But the domain/flags are always the same
>
> Full dmesg (without IOMMU messages) https://uk01.z.antigena.com/l/VspdfbZQLwA2gZviRaGoPfE2bAxamMd9VFWOj4n78OuhpCoBo5HcXgWgXfTVvyxW1R3W9GTx4RbHm1MGyqBINkuTrnW31h9eTfLTUvXfcYh-IaTwmSc5kZo_-iU9-qQLbKsIjA9LNxyfbAA2AKGOSws6K4vuOrR6i-DL5DiQW1gHCrhhBMgE0Y7RK2m9
>
> The server is doing qemu GPU passthough via VFIO.
> I believe (aka I need to re-verify) that message start whatever qemu starts or not.
Oh, it's certainly not impossible that that getting VFIO involved may
tickle some bug or misconfiguration wherein the wrong device ends up
inadvertently attached to the wrong domain. I don't know the ins and
outs of debugging with the AMD driver, though, so I think this is the
point where I have to leave this one to Vasant :)
Thanks,
Robin.
Powered by blists - more mailing lists