lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180913132453.GE11702@lunn.ch>
Date:   Thu, 13 Sep 2018 15:24:53 +0200
From:   Andrew Lunn <andrew@...n.ch>
To:     Eran Ben Elisha <eranbe@...lanox.com>
Cc:     netdev@...r.kernel.org, Jiri Pirko <jiri@...lanox.com>,
        Andy Gospodarek <andrew.gospodarek@...adcom.com>,
        Michael Chan <michael.chan@...adcom.com>,
        Jakub Kicinski <jakub.kicinski@...ronome.com>,
        Simon Horman <simon.horman@...ronome.com>,
        Alexander Duyck <alexander.duyck@...il.com>,
        Florian Fainelli <f.fainelli@...il.com>,
        Tal Alon <talal@...lanox.com>,
        Ariel Almog <ariela@...lanox.com>
Subject: Re: [RFC PATCH iproute2-next] man: Add devlink health man page

On Thu, Sep 13, 2018 at 03:49:37PM +0300, Eran Ben Elisha wrote:
> 
> 
> On 9/13/2018 3:08 PM, Andrew Lunn wrote:
> >>        devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
> >>            Sets TX_COMP_ERROR sensor parameters for a specific device.
> >
> >I hope the real sensors have more understandable names. If i remember
> >correctly, the same sort of comment was given for resource
> >management. It was pretty unclear what the resource names actually
> >mean. Is an average user going to have any idea how to actually use
> >these sensors and actions?
> 
> well, hopefully. the whole point is to have it fully controlled by the user.
> However, names for the command should be short. I guess we shall have it
> documented (challenge is to fit to multi vendors).
> 
> >
> >Can you give more examples of sensors. We should understand if there
> >are any overlaps with hwmon.
> 
> I restate here that we shall have SW sensors as well, and not only HW
> sensors.
> 
> This is what I had in mind:
> 1. command interface error
> 2. command interface timeout
> 3. stuck TX queue (like tx_timeout)
> 4. stuck TX completion queue (driver did not process packets in a reasonable
> time period)
> 5. stuck RX queue
> 6. RX completion error
> 7. TX completion error
> 8. HW / FW catastrophic error report
> 9. completion queue overrun

Hi Eran

I'm having trouble differentiating between these SW sensors and bugs
which need fixing. What causes a command interface error? Sending it a
command it does not understand? A wrongly formatted command? A command
the version of the firmware does not support? These all sound just
like plain old bugs which need fixing, not something which needs a
framework to detect them and try to recover from them by resetting
something.

I would of expected that all the issues are about physical
properties. Something similar to SMART for hard disks. The power
supplies are starting to droop, suggesting it might die soon. The
tacho on the fan suggests the FAN is not rotating as fast as it
should, so the motor is going to die soon. An SFP is giving i2c
errors, suggesting it is not seated correctly. The card as a whole is
overheating, despite the fan working, suggesting the ambient
temperature is just too high.

	Andrew

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ