netdev - Re: [PATCH net-next 0/5][pull request] add v2 FW logging for ice driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bb0d1ef5-3045-919b-adb9-017c86c862ec@intel.com>
Date:   Tue, 14 Feb 2023 08:14:04 -0800
From:   Paul M Stillwell Jr <paul.m.stillwell.jr@...el.com>
To:     Jakub Kicinski <kuba@...nel.org>
CC:     Tony Nguyen <anthony.l.nguyen@...el.com>, <davem@...emloft.net>,
        <pabeni@...hat.com>, <edumazet@...gle.com>,
        <netdev@...r.kernel.org>, <jacob.e.keller@...el.com>,
        <jiri@...dia.com>, <idosch@...sch.org>
Subject: Re: [PATCH net-next 0/5][pull request] add v2 FW logging for ice
 driver

On 2/13/2023 4:40 PM, Jakub Kicinski wrote:
> On Mon, 13 Feb 2023 15:46:53 -0800 Paul M Stillwell Jr wrote:
>> On 2/10/2023 8:23 PM, Jakub Kicinski wrote:
>>> Can you describe how this is used a little bit?
>>> The FW log is captured at some level always (e.g. warns)
>>> or unless user enables _nothing_ will come out?
>>
>> My understanding is that the FW is constantly logging data into internal
>> buffers. When the user indicates what data they want and what level they
>> want then the data is filtered and output via either the UART or the
>> Admin queues. These patches retrieve the FW logs via the admin queue
>> commands.
> 
> What's the trigger to perform the collection?
> 
> If it's some error condition / assert in FW then maybe it's worth
> wrapping it up (or at least some portion of the functionality) into
> devlink health?

The trigger is the user asking to collect the FW logs. There isn't 
anything within the FW that triggers the logging; generally there is 
some issue on the user side and we think there may be some issue in the 
FW or that FW can provide more info on what is going on so we request FW 
logs. As an example, sometimes users report issues with link flap and we 
request FW logs to see what the FW link management code thinks is 
happening. In this example there is no "error" per se, but the user is 
seeing some undesired behavior and we are looking for more information 
on what could be going on.

> 
> AFAIU the purpose of devlink health is exactly to bubble up to the host
> asserts / errors / crashes in the FW, with associated "dump".
> 

Maybe it is, but when I look at devlink health it doesn't seem like it 
is designed for something like this. It looks like (based on my reading 
of the documentation) that it responds to errors from the device; that's 
not really what is happening in our case. The user is seeing some 
behavior that they don't like and we are asking the FW to shed some 
light on what the FW thinks is happening.

Link flap is an excellent example of this. The FW is doing what it 
believes to be the correct thing, but due to some change on the link 
partner that the FW doesn't handle correctly then there is some issue. 
This is a classic bug, the code thinks it's doing the correct thing and 
in reality it is not.

In the above example nothing on the device is reporting an error so I 
don't see how the health reporter would get triggered.

Also, devlink health seems like it is geared towards a model of the 
device has an error, the error gets reported to the driver, the driver 
gets some info to report to the user, and the driver moves on. The FW 
logging is different from that in that we want to see data across a long 
period of time generally because we can't always pinpoint the time that 
the thing we want to see happened.

>> The output from the FW is a binary blob that a user would send back to
>> Intel to be decoded. This is only used for troubleshooting issues where
>> a user is working with someone from Intel on a specific problem.
> 
> I believe that's in line with devlink health. The devlink health log
> is "formatted" but I really doubt that any user can get far in debugging
> without vendor support.
> 

I agree, I just don't see what the trigger is in our case for FW logging.

>>> On Thu,  9 Feb 2023 11:06:57 -0800 Tony Nguyen wrote:
>>>> devlink dev param set <pci dev> name fwlog_enabled value <true/false> cmode runtime
>>>> devlink dev param set <pci dev> name fwlog_level value <0-4> cmode runtime
>>>> devlink dev param set <pci dev> name fwlog_resolution value <1-128> cmode runtime
>>>
>>> If you're using debugfs as a pipe you should put these enable knobs
>>> in there as well.
>>
>> My understanding is that debugfs use as a write mechanism is frowned on.
>> If that's not true and if we were to submit patches that used debugfs
>> instead of devlink and they would be accepted then I'll happily do that. :)
> 
> Frowned upon, but any vendor specific write API is frowned up, I don't
> think the API is the matter of devlink vs debugfs. To put it differently -
> a lot of people try to use devlink params or debugfs without stopping
> to think about how the interface can be used and shared across vendors.
> Or even more sadly - how the end user will integrate them into their
> operations / fleet management.
> 
>> Or add a proper devlink command to carry all this
>>> information via structured netlink (fw log + level + enable are hardly
>>> Intel specific).
>>
>> I don't know how other companies FW interface works so wouldn't assume
>> that I could come up with an interface that would work across all devices.
> 
> Let's think about devlink health first.

I'm happy to think about it, but as I said I don't see how our FW 
logging model fits into the paradigm of devlink health. I'm open to 
suggestions because I may not have thought about this in a way that 
would fit into devlink health.