[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20150105164916.GB13975@8bytes.org>
Date: Mon, 5 Jan 2015 17:49:17 +0100
From: Joerg Roedel <joro@...tes.org>
To: Raimonds Cicans <ray@...llo.lv>
Cc: linux-kernel@...r.kernel.org
Subject: Re: Question about: AMD-Vi: Event logged [IO_PAGE_FAULT ...
Hello Raimonds,
On Mon, Jan 05, 2015 at 05:25:25PM +0200, Raimonds Cicans wrote:
> After kernel upgrade (3.13 => 3.17) I started to receive following
> string in my logs:
> AMD-Vi: Event logged [IO_PAGE_FAULT device=08:00.0 domain=0x001c
> address=0x0000000001355000 flags=0x0000]
>
> I would like to deeper understand this problem, so it
> would be nice if some body can fix my assumptions and
> answer my questions.
>
>
> Assumptions:
>
> 1) This message is generated by AMD IOMMU subsystem
> because PCIe device 08:00.0 tried to access memory
> region which was not mapped to any real memory
> (lspci show that this device is DVB-S2 receiver card
> TBS 6981)
>
> 2) Because flags are 0 and because in general receivers
> write to memory not read from memory it is memory
> write operation
Almost right, but flags are 0 for this fault which means it was a read
operation. The operation was to a page marked as non-present. This
caused the fault.
> 3) Possible causes:
> a) memory region was never mapped
> b) device accessed memory region before it was mapped
> c) device accessed memory region after it was unmapped
I'd vote for option c) The address reported in the fault is a device
virtual address. The value looks like it was handed out from the
DMA-address allocator in the AMD IOMMU driver, which means the address
was once mapped for the device.
>
> 3) Suspects:
> a) kernel's DMA subsystem: very unlikely
> b) kernel's IOMMU subsystem: very unlikely
> c) AMD IOMMU driver: unlikely? - i had problems with AMD IOMMU
> itself in kernels 3.14 - 3.17 (AMD-Vi: Completion-Wait loop
> timed out)
> So maybe this problem not fully fixed?
IO_PAGE_FAULTs are almost always a bug in the device driver for the
peripheral (or a bug in the firmware, but that is unlikely here).
But the "Completion-Wait loop timed out" message is also worrying. It
usually indicates broken firmware or broken hardware.
> d) Receiver's driver: likely
Yes, my guess is that the driver for the receiver device calls
dma_unmap_$foo on a memory region it still uses for DMA. But the call
lets the AMD IOMMU driver unmap the region and DMA fails with the
message you see.
> Questions:
> 1) What 'domain=0x001c' mean?
This is just an internal handle and means the domain-id. It is reported
in the fault structure by the hardware and indicates whether the device
has been attached to a DMA domain at all.
> 2) Where I can find definition of possible flags?
In the AMD IOMMU specification, look for the IO_PAGE_FAULT reporting
structure. The flags reported in the kernel message are bits 16-27 of
the second 32bit value.
> 3) What kind of address is written in message?
> - physical?
> - virtual?
> - address from devices point of view?
It is a device virtual address, the address the device tried to access
but which was not mapped.
HTH,
Joerg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists