linux-kernel - Re: iommu: flood of ahci 0000:e6:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0055 address=0xa14a4000 flags=0x0070]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <ce6d471a-a863-45ca-8a35-66dbbabe47c1@arm.com>
Date: Wed, 5 Feb 2025 18:53:23 +0000
From: Robin Murphy <robin.murphy@....com>
To: Corentin Labbe <clabbe.montjoie@...il.com>
Cc: joro@...tes.org, suravee.suthikulpanit@....com, will@...nel.org,
 iommu@...ts.linux.dev, linux-kernel@...r.kernel.org,
 Vasant Hegde <vasant.hegde@....com>
Subject: Re: iommu: flood of ahci 0000:e6:00.0: AMD-Vi: Event logged
 [IO_PAGE_FAULT domain=0x0055 address=0xa14a4000 flags=0x0070]

On 2025-02-05 1:36 pm, Corentin Labbe wrote:
> Le Mon, Feb 03, 2025 at 01:01:45PM +0000, Robin Murphy a écrit :
>> On 2025-02-03 9:05 am, Corentin Labbe wrote:
>>> Hello
>>>
>>> I have a supermicro server which is flooded of kernel message:
>>> ahci 0000:e6:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0055 address=0xa14a4000 flags=0x0070]
>>>
>>> The server works perfectly anyway.
>>> It happens with official ubuntu kernel vmlinuz-6.8.0-51-generic.
>>> I tried also a custom 6.12.6, same problem.
>>>
>>> I tried to update bios, no change.
>>> I tried iommu=soft, no change.
>>>
>>> I dont know what to do next.
>>>
>>> Regards
>>>
>>
>>> IOMMU group 83 e6:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9230 PCIe 2.0 x2 4-port SATA 6 Gb/s RAID Controller [1b4b:9230] (rev 11)
>>
>> Wow, a Marvell SATA controller doing something other than the usual
>> phantom function quirk, that's a nice change :D
>>
>> I'd guess that firmware has left it running for something like legacy
>> IDE emulation (if that's still a thing?) or its own soft-RAID driver,
>> but neglected to declare an IVMD entry to described the reserved memory
>> region(s) it's using for that. A smoking gun would be if 0xa14a4000
>> matches some firmware-reserved PA in the system memory map. In that
>> case, if you're lucky you might have some firmware/BIOS option to
>> disable fancy behaviour and leave it in plain AHCI mode. Otherwise,
>> booting with "iommu.passthrough=1" (or the even bigger hammer of
>> "amd_iommu=off") should at least allow you to ignore the issue.
>>
> 
> Hello
> 
> Thanks for your help
> 
> There was no AHCI option in the BIOS (appart hotplug enable).
> 
> Adding iommu.passthrough=1 lead to absence of thoses messages.
> 
> Unfortunatly, my example is not correct, the address is mostly random:
> dmesg |grep IO_PAGE_FAULT | grep -o 'address=0x[0-9a-f]*' | sort | uniq -c | wc -l
> 9297
> 
> dmesg |grep IO_PAGE_FAULT | grep -o 'address=0x[0-9a-f]*' | sort | uniq -c | head
>        2 address=0x1101f000
>        2 address=0x1101f004
>        3 address=0x1102f000
>        1 address=0x1102f004
>        2 address=0x1102f008
>        2 address=0x1102f010
>        2 address=0x11043000
>        2 address=0x11043004
>        1 address=0x11047000
>        1 address=0x11047004
> 
> dmesg |grep IO_PAGE_FAULT | grep -o 'address=0x[0-9a-f]*' | sort | uniq -c | tail
>        2 address=0xfffffffffe751004
>        2 address=0xfffffffffe7e6000
>        2 address=0xfffffffffe7e6004
>        4 address=0xfffffffffe823000
>        3 address=0xfffffffffe823004
>        2 address=0xfffffffffe830000
>        2 address=0xfffffffffe830004
>        3 address=0xfffffffffe833000
>        1 address=0xfffffffffe833004
>        1 address=0xfffffffffe833008

OK, these look like iommu-dma addresses, and the fact that they're up 
into the full 64-bit space implies that the 32-bit ones are most likely 
also kernel DMA burning through the whole 32-bit IOVA space rather than 
inadvertent physical address (and possibly the SATA driver is leaking 
DMA mappings as it keeps getting errors and retrying?). Indeed it seems 
the firmware stuff probably was a red herring.

I guess that then points to a question of whether it's maybe just the 
SATA driver going wonky and trying to make the device write to a 
DMA_TO_DEVICE mapping, or something going awry at the IOMMU to divert 
the device accesses to a different address space from the one iommu-dma 
believes it's using...

> But the domain/flags are always the same
> 
> Full dmesg (without IOMMU messages) https://uk01.z.antigena.com/l/VspdfbZQLwA2gZviRaGoPfE2bAxamMd9VFWOj4n78OuhpCoBo5HcXgWgXfTVvyxW1R3W9GTx4RbHm1MGyqBINkuTrnW31h9eTfLTUvXfcYh-IaTwmSc5kZo_-iU9-qQLbKsIjA9LNxyfbAA2AKGOSws6K4vuOrR6i-DL5DiQW1gHCrhhBMgE0Y7RK2m9
> 
> The server is doing qemu GPU passthough via VFIO.
> I believe (aka I need to re-verify) that message start whatever qemu starts or not.

Oh, it's certainly not impossible that that getting VFIO involved may 
tickle some bug or misconfiguration wherein the wrong device ends up 
inadvertently attached to the wrong domain. I don't know the ins and 
outs of debugging with the AMD driver, though, so I think this is the 
point where I have to leave this one to Vasant :)

Thanks,
Robin.