[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <884e0ae5-f9a2-734d-b46c-5b22678fbe6d@candelatech.com>
Date: Fri, 14 Apr 2017 09:41:04 -0700
From: Ben Greear <greearb@...delatech.com>
To: Alexander Duyck <alexander.duyck@...il.com>
Cc: "linux-wireless@...r.kernel.org" <linux-wireless@...r.kernel.org>,
netdev <netdev@...r.kernel.org>
Subject: Re: How to debug DMAR errors?
On 04/14/2017 09:24 AM, Alexander Duyck wrote:
> On Fri, Apr 14, 2017 at 9:19 AM, Ben Greear <greearb@...delatech.com> wrote:
>>
>>
>> On 04/14/2017 08:45 AM, Alexander Duyck wrote:
>>>
>>> On Thu, Apr 13, 2017 at 11:12 AM, Ben Greear <greearb@...delatech.com>
>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I have been seeing a regular occurrence of DMAR errors, looking something
>>>> like this when testing my ath10k driver/firmware under some specific
>>>> loads
>>>> (maximum receive of 512 byte frames in AP mode):
>>>>
>>>> DMAR: DRHD: handling fault status reg 3
>>>> DMAR: [DMA Read] Request device [05:00.0] fault addr fd99f000 [fault
>>>> reason
>>>> 06] PTE Read access is not set
>>>> ath10k_pci 0000:05:00.0: firmware crashed! (uuid
>>>> 594b1393-ae35-42b5-9dec-74ff0c6791ff)
>>>>
>>>> So, I am wondering if there is any way I can get more information about
>>>> what
>>>> this fd99f000 address
>>>> is?
>>>>
>>>> Once this problem hits, the entire OS locks hard (not even sysrq-boot
>>>> will
>>>> do anything),
>>>> so I guess I would need the DMAR logic to print out more info on that
>>>> address somehow.
>>>>
>>>> Thanks,
>>>> Ben
>>>
>>>
>>> There isn't much more info to give you. The problem is that the device
>>> at 5:00.0 attempted to read at fd99f000 even though it didn't have
>>> permissions. In response this should trigger a PCI Master Abort
>>> message to that function. It looks like the firmware for the device
>>> doesn't handle that and so that is likely why things got hung.
>>>
>>> Really you would need to interrogate the ath10k_pci to see if there
>>> is/was a mapping somewhere for that address and what it was supposed
>>> to be used for.
>>
>>
>> I'm working on a hook in DMAR logic to call into ath10k_pci when the
>> error is seen, so the ath10k can dump debug info, including recent DMA
>> addresses.
>>
>> My code is an awful hack so far, but if someone could add a clean way to
>> register
>> DMAR error callbacks, I think that would be very welcome. It might could
>> tie into
>> automated dma map/unmap debugging logic, and at the least, someone could
>> write custom debugging callbacks
>> for the driver(s) in question.
>>
>> Thanks,
>> Ben
>>
>
> You might look at coding up something to add pci_error_handlers for
> the pci_driver in the ath10k_pci driver. The PCI Master Abort should
> trigger an error that you could then capture in the driver and handle
> at least dumping it via your own implementation of the error handlers.
> If nothing else I suspect there are probably some sort of descriptor
> rings you could probably dump. I'm suspecting this is some sort of Tx
> issue since the problem was a read fault, but I suppose there are
> other paths in the driver that might trigger DMA read requests.
This is a thick firmware driver, so the firmware could also be screwing up
and accessing something it should not. There are some existing work-arounds
in it to deal with sketchy behaviour already, maybe more are needed.
Anyway, once I added the debugging code, I didn't see it crash again, so
might be a while before I know more.
Thanks,
Ben
>
> - Alex
>
--
Ben Greear <greearb@...delatech.com>
Candela Technologies Inc http://www.candelatech.com
Powered by blists - more mailing lists