netdev - Re: [PATCH RFC net-next 00/19] Devlink health reporting and recovery system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <44d02f4d-f5c9-a5dd-9392-97e1268f5371@mellanox.com>
Date:   Thu, 3 Jan 2019 13:31:59 +0000
From:   Eran Ben Elisha <eranbe@...lanox.com>
To:     Jakub Kicinski <jakub.kicinski@...ronome.com>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Jiri Pirko <jiri@...lanox.com>,
        Moshe Shemesh <moshe@...lanox.com>,
        Aya Levin <ayal@...lanox.com>, Tal Alon <talal@...lanox.com>,
        Ariel Almog <ariela@...lanox.com>
Subject: Re: [PATCH RFC net-next 00/19] Devlink health reporting and recovery
 system



On 1/3/2019 12:46 AM, Jakub Kicinski wrote:
> On Tue, 1 Jan 2019 09:58:30 +0000, Eran Ben Elisha wrote:
>> On 1/1/2019 3:47 AM, Jakub Kicinski wrote:
>>> The addition of "objdump" and its marshalling is a bit disappointing.
>>> It seemed to me when region snapshots were added that they would serve
>>> this exact purpose.  Taking a region snapshot or "core dump", if you
>>> will, after something went south.  Now it's confusing to have two
>>> mechanisms serving the same purpose.
>>
>> The motivation here was that the driver can provide reporters to its
>> sub-modules, such that each reporter will be able to provide all needed
>> info and recover methods to face run time errors.
>>
>> The implementation of the objdump function is in the hands of the
>> reporter developer, and he can dump whatever he thinks it is needed.
>> Keep in mind that a driver can have many reporters (TX, RX, FW, command
>> interface, etc). For most of the reporters, there is important
>> information in the driver which can not be fetched with region snapshot
>> (intended for memory fetching only).
>> For SW info, driver shall build the info and send it interpreted (unlike
>> all dumps / region available mechanisms)
>> I have in plans to extend the TX reporter to have objdump method in
>> which I will pass many SQ SW attributes that can be very handy in a
>> debug session.
> 
> My feeling is that instead of duplicating this infrastructure we should
> try to grow region snapshots beyond the "HW memory dumper".  In a
> debugging session you may want to have dumps as well as read the state
> live.  Region snapshot API was built with this promise in mind.  The
> information in reporter dump is not easily available other than when
> the dump happens, which is not great in a debugging session.  Driver
> will have to expose it via debugfs/region dump/ethtool dump or some
> such for live debug.

Arch wise those are two different features which we shouldn't mix.
The region dump is aiming at dumping of information for monitoring of 
"HW memory" at real time, more like a dumb channel to provide memory 
chunks from HW to user.

The health is a user facing tool which provides a centralized vision on 
a device health status with diagnose, recover and dump options divided 
via sub-reporters in a real time and can save critical data 
automatically in case of a reported error.

Regarding the need for other tool for live debug, it is not true. 
Devlink health objdump command provides latest stored dump, as well as 
dumping now for live debug.
It is even better, as driver reporters already contains the needed HW 
and SW buffers according to the device and the error. Unlike region dump 
that requires the administrator to set regions while the administrator 
is not an expert of mapping between required info, device model and 
reported error.

> 
>>> It's unclear from quick reading of the code how if the buffers get
>>> timestamped.  Can you report more than one?
>>
>> The timestamp which devlink health reports on, is the timestamp in which
>> it got the buffers filled by the driver. Every dump/diagnose has one ts.
>> Per reporter, it is possible to store up to one dump. only clear command
>> can remove it and makes the reporter ready to fetch a new objdump.
> 
> Region snapshots support collecting multiple snapshots IIRC, no?

it does allow multiple snapshots. This can be easily added to the 
devlink health if we wish to. I didn't see the current need for it.

> 
>>> About the marshalling, I'm not sure it belongs in the kernel.  There is
>>> precedent for adding interpretation of FW blobs in user space (ethtool).
>>> IMHO it's a more scalable approach, if we want to avoid having 100 kLoC
>>> drivers.  Amount of code you add to print the simple example from last
>>> patch is not inspiring confidence.
>>
>> The idea was to provide the developer the ability to create "tree-like"
>> of information, it is needed when you want to describe complex objects.
>> This caused a longer coding, but I don't think we are even close to the
>> scale you are talking about.
>> We can remove the tree flexibility, and move to array format, it will
>> make the code of storing data by the driver to be shorter, but we will
>> lose the ability to have it in tree-like format.
> 
> To be clear I slightly oppose the marshalling in the first place.  It
> may be better to just dump the data as is, and have user space know
> what the interpretation is.

We provides a way to store the data in nested layers. In internal 
discussions with Jiri, we decided that this is the correct approach.
However, if one insists, it can fill the buffers with raw binary and 
label it as such.

> 
>> And again, regarding ethtool, it is a tool for dumping HW/FW, this could
>> have been an argument for the region snapshot, not for the devlink health...
> 
> I've seen drivers dumping ring state and other SW info via ethtool.
> All debugging APIs end up "mixed source" in my experience.
> 
>>> And on the bike shedding side :) -> I think you should steer clear of
>>> calling this objdump, as it has very little to do with the
>>> functionality of well-known objdump tool.  Its not even clear what the
>>> object the name is referring to is.
>> Let's agree on concept, we can change name to dump. Reporter->dump is
>> very clear when you know what the reporter is.
> 
> Thanks!
> 
>>> Long story short the overlap with region snapshots makes it unclear
>>> what purpose either serves, and IMHO we should avoid the marshalling by
>>> teaching user space how to interpret snapshots.  Preferably we only
>>> have one dump mechanism, and user space can be taught the interpretation
>>> once.
>> Forcing SW reporters to use region snapshot mechanism sounds like a bad
>> idea.
> 
> I'm not super excited about reusing region API for SW info.  But I like
> it more than having multitude of debug APIs for drivers to implement
> with largely overlapping functionality and semantics.

The dumping of HW information is just a very small portion of the 
devlink health. If driver developer thinks he can use existing region 
API to fetch some data into its reporter, he can do so in his dump method.

> 
>> To summarize, In order to have the health system working properly, it
>> must have a way to objdump/dump itself and provide it to the developer
>> for debug. Removing the objdump part will make it useless for run-time
>> debug.
>>
>> I think this is a powerful tool and we shall not ask the user level
>> scripts to fetch info from many other partial tools. It shall all be
>> focused in one place (recover, diagnose, objdump, statistics).
> 
> I don't think having reporter API refer to snapshot IDs is "many other
> partial tools" if that's the suggestion.  Seems like you want to focus
> all the reporter stuff in one place, and I want to focus the debug APIs
> a little :)
> 

As I can see it, this tool is an envelop to all health functionality and 
should provide good and easy to use interface.
This tool should contain all health related dumps (unfortunately as the 
subsystem runs for a long time, it cannot be the sole provider of it, 
but aspire to be the leading one).

With devlink health dump you can get:
1. Run time dumps
2. Automatic error related dumps from the time the error happened stored 
in the memory waiting to be fetched.
3. Driver parsed data as well as raw data (up to the developer to 
decide, API can support both).
4. OOB configured data marked by the driver developers as relevant for 
the failure

Non of the existing tools can provide such features. And the option to 
add them to an existing tool doesn't seem possible / correct even in theory.

Removing the dump option from devlink for the sake of duplicate HW 
memory dump is a killer for this feature. And as A driver developer I 
really think that the networking subsystem needs it.