netdev - Re: [RFC PATCH iproute2-next] man: Add devlink health man page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <66584ca2-8efa-9a6d-c1f3-1cf81cb04259@mellanox.com>
Date:   Thu, 13 Sep 2018 17:30:27 +0300
From:   Eran Ben Elisha <eranbe@...lanox.com>
To:     Andrew Lunn <andrew@...n.ch>
Cc:     netdev@...r.kernel.org, Jiri Pirko <jiri@...lanox.com>,
        Andy Gospodarek <andrew.gospodarek@...adcom.com>,
        Michael Chan <michael.chan@...adcom.com>,
        Jakub Kicinski <jakub.kicinski@...ronome.com>,
        Simon Horman <simon.horman@...ronome.com>,
        Alexander Duyck <alexander.duyck@...il.com>,
        Florian Fainelli <f.fainelli@...il.com>,
        Tal Alon <talal@...lanox.com>,
        Ariel Almog <ariela@...lanox.com>
Subject: Re: [RFC PATCH iproute2-next] man: Add devlink health man page



On 9/13/2018 4:24 PM, Andrew Lunn wrote:
> On Thu, Sep 13, 2018 at 03:49:37PM +0300, Eran Ben Elisha wrote:
>>
>>
>> On 9/13/2018 3:08 PM, Andrew Lunn wrote:
>>>>         devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
>>>>             Sets TX_COMP_ERROR sensor parameters for a specific device.
>>>
>>> I hope the real sensors have more understandable names. If i remember
>>> correctly, the same sort of comment was given for resource
>>> management. It was pretty unclear what the resource names actually
>>> mean. Is an average user going to have any idea how to actually use
>>> these sensors and actions?
>>
>> well, hopefully. the whole point is to have it fully controlled by the user.
>> However, names for the command should be short. I guess we shall have it
>> documented (challenge is to fit to multi vendors).
>>
>>>
>>> Can you give more examples of sensors. We should understand if there
>>> are any overlaps with hwmon.
>>
>> I restate here that we shall have SW sensors as well, and not only HW
>> sensors.
>>
>> This is what I had in mind:
>> 1. command interface error
>> 2. command interface timeout
>> 3. stuck TX queue (like tx_timeout)
>> 4. stuck TX completion queue (driver did not process packets in a reasonable
>> time period)
>> 5. stuck RX queue
>> 6. RX completion error
>> 7. TX completion error
>> 8. HW / FW catastrophic error report
>> 9. completion queue overrun
> 
> Hi Eran
> 
> I'm having trouble differentiating between these SW sensors and bugs
> which need fixing. What causes a command interface error? Sending it a
> command it does not understand? A wrongly formatted command? A command
> the version of the firmware does not support? These all sound just
> like plain old bugs which need fixing, not something which needs a
> framework to detect them and try to recover from them by resetting
> something.

Such issues do exist in production environment, and need to be handled 
even if root cause is a bug which will be fixed in latest release. My 
feature should help developers / administrator to control and recover 
their live systems, by auto correction and logging support.
Goal is:
- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed 
debugging information.

> 
> I would of expected that all the issues are about physical
> properties. Something similar to SMART for hard disks. The power
> supplies are starting to droop, suggesting it might die soon. The
> tacho on the fan suggests the FAN is not rotating as fast as it
> should, so the motor is going to die soon. An SFP is giving i2c
> errors, suggesting it is not seated correctly. The card as a whole is
> overheating, despite the fan working, suggesting the ambient
> temperature is just too high.

AFAIU, the kind of sensors you suggest here requires manual fix / 
physically approaching to the setup, replace HW, install new Fan, etc.
Monitor such events is easy, driver can just log events from HW to the 
dmesg and end its handle there.
None of these is a real networking issue I would like to handle with 
devlink-health.

Eran

> 
> 	Andrew
>