[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e0f8f797-f7c5-efe6-0e40-8e5fb161a7ff@intel.com>
Date: Tue, 26 May 2020 14:00:30 -0700
From: Jacob Keller <jacob.e.keller@...el.com>
To: Jiri Pirko <jiri@...nulli.us>, Jakub Kicinski <kuba@...nel.org>
Cc: Ido Schimmel <idosch@...sch.org>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
petrm@...lanox.com, amitc@...lanox.com
Subject: Re: devlink interface for asynchronous event/messages from firmware?
On 5/22/2020 4:00 AM, Jiri Pirko wrote:
> Thu, May 21, 2020 at 11:51:13PM CEST, kuba@...nel.org wrote:
>> On Thu, 21 May 2020 13:59:32 -0700 Jacob Keller wrote:
>>>>> So the ice firmware can optionally send diagnostic debug messages via
>>>>> its control queue. The current solutions we've used internally
>>>>> essentially hex-dump the binary contents to the kernel log, and then
>>>>> these get scraped and converted into a useful format for human consumption.
>>>>>
>>>>> I'm not 100% of the format, but I know it's based on a decoding file
>>>>> that is specific to a given firmware image, and thus attempting to tie
>>>>> this into the driver is problematic.
>>>>
>>>> You explained how it works, but not why it's needed :)
>>>
>>> Well, the reason we want it is to be able to read the debug/diagnostics
>>> data in order to debug issues that might be related to firmware or
>>> software mis-use of firmware interfaces.
>>>
>>> By having it be a separate interface rather than trying to scrape from
>>> the kernel message buffer, it becomes something we can have as a
>>> possibility for debugging in the field.
>>
>> For pure debug/tracing perhaps trace_devlink_hwerr() is the right fit?
>
> Well, trace_devlink_hwerr() is for simple errors that are mapped 1:1
> with some string. From what I got, Jacob needs to pass some data
> structures to the user. Something more similar to health reporter dumps
> and their fmsg.
>
Right. From my understanding the messages for debugging are not in a
format that can be immediately turned into a text string.
The reasoning behind this is that the set of messages changes,
(especially during early firmware bringup) and thus sending actual ASCII
messages doesn't work well. It goes back to the "firmware is a black box".
The problem is that in practice, we need ways to help debug this black
box, and this was one method that doesn't require hooking up a more
expensive device to intercept and debug with a step-through debugger. It
also enables capturing more verbose information about what the firmware
is doing.
But from how I understand it, the messages can't really be immediately
interpreted into usable format by the kernel. I suppose in theory they
could but it then requires carrying the full translation table.
Today, this is done by using a custom driver which logs the messages
directly to the kernel log buffer, which we know isn't the best solution.
Using a trace point is less bad, since that goes into the tracefs, and
will be disabled by default and goes into the tracefs system instead of
going into the default print buffer...
The pain is the fact that we have to request loading a custom driver
that enables these prints, meaning that it is harder to obtain the data
than if we can just say "enable firmware logs, reproduce the issue, and
grab this data"
Thanks,
Jake
Powered by blists - more mailing lists