linux-kernel - Re: Question about: AMD-Vi: Event logged [IO_PAGE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20150105164916.GB13975@8bytes.org>
Date:	Mon, 5 Jan 2015 17:49:17 +0100
From:	Joerg Roedel <joro@...tes.org>
To:	Raimonds Cicans <ray@...llo.lv>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: Question about: AMD-Vi: Event logged [IO_PAGE_FAULT ...

Hello Raimonds,

On Mon, Jan 05, 2015 at 05:25:25PM +0200, Raimonds Cicans wrote:
> After kernel upgrade (3.13 => 3.17) I started to receive following
> string in my logs:
> AMD-Vi: Event logged [IO_PAGE_FAULT device=08:00.0 domain=0x001c
> address=0x0000000001355000 flags=0x0000]
> 
> I would like to deeper understand this problem, so it
> would be nice if some body can fix my assumptions and
> answer my questions.
> 
> 
> Assumptions:
> 
> 1) This message is generated by AMD IOMMU subsystem
>      because PCIe device 08:00.0 tried to access memory
>      region which was not mapped to any real memory
>      (lspci show that this device is DVB-S2 receiver card
>       TBS 6981)
> 
> 2) Because flags are 0 and because in general receivers
>     write to memory not read from memory it is memory
>     write operation

Almost right, but flags are 0 for this fault which means it was a read
operation. The operation was to a page marked as non-present. This
caused the fault.

> 3) Possible causes:
>     a) memory region was never mapped
>     b) device accessed memory region before it was mapped
>     c) device accessed memory region after it was unmapped

I'd vote for option c) The address reported in the fault is a device
virtual address. The value looks like it was handed out from the
DMA-address allocator in the AMD IOMMU driver, which means the address
was once mapped for the device.

> 
> 3) Suspects:
>      a) kernel's DMA subsystem: very unlikely
>      b) kernel's IOMMU subsystem: very unlikely
>      c) AMD IOMMU driver: unlikely? - i had problems with AMD IOMMU
>          itself in kernels 3.14 - 3.17 (AMD-Vi: Completion-Wait loop
> timed out)
>          So maybe this problem not fully fixed?

IO_PAGE_FAULTs are almost always a bug in the device driver for the
peripheral (or a bug in the firmware, but that is unlikely here).

But the "Completion-Wait loop timed out" message is also worrying. It
usually indicates broken firmware or broken hardware.

>      d) Receiver's driver: likely

Yes, my guess is that the driver for the receiver device calls
dma_unmap_$foo on a memory region it still uses for DMA. But the call
lets the AMD IOMMU driver unmap the region and DMA fails with the
message you see.

> Questions:
> 1) What 'domain=0x001c' mean?

This is just an internal handle and means the domain-id. It is reported
in the fault structure by the hardware and indicates whether the device
has been attached to a DMA domain at all.

> 2) Where I can find definition of possible flags?

In the AMD IOMMU specification, look for the IO_PAGE_FAULT reporting
structure. The flags reported in the kernel message are bits 16-27 of
the second 32bit value.

> 3) What kind of address is written in message?
>      - physical?
>      - virtual?
>      - address from devices point of view?

It is a device virtual address, the address the device tried to access
but which was not mapped.

HTH,

	Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/