lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <184f9f87-00c2-e540-32a8-44053cd9f3c0@mellanox.com>
Date:   Tue, 25 Sep 2018 15:00:02 +0300
From:   Eran Ben Elisha <eranbe@...lanox.com>
To:     Jakub Kicinski <jakub.kicinski@...ronome.com>
Cc:     netdev@...r.kernel.org, Jiri Pirko <jiri@...lanox.com>,
        Andy Gospodarek <andrew.gospodarek@...adcom.com>,
        Michael Chan <michael.chan@...adcom.com>,
        Simon Horman <simon.horman@...ronome.com>,
        Alexander Duyck <alexander.duyck@...il.com>,
        Andrew Lunn <andrew@...n.ch>,
        Florian Fainelli <f.fainelli@...il.com>,
        Tal Alon <talal@...lanox.com>,
        Ariel Almog <ariela@...lanox.com>
Subject: Re: [RFC PATCH iproute2-next] System specification health API



On 9/16/2018 1:37 PM, Eran Ben Elisha wrote:
> 
> 
> On 9/13/2018 8:36 PM, Jakub Kicinski wrote:
>> On Thu, 13 Sep 2018 11:18:15 +0300, Eran Ben Elisha wrote:
>>> The health spec is targeted for Real Time Alerting, in order to know 
>>> when
>>> something bad had happened to a PCI device
>>
>> By spec you mean some standards body spec you implement or this
>> proposal is a spec?
> 
> This proposal is a spec
> 
>>
>>> - Provide alert debug information
>>> - Self healing
>>> - If problem needs vendor support, provide a way to gather all needed 
>>> debugging
>>>    information.
>>>
>>> The health contains sensors which sense for malfunction. Once sensor 
>>> triggered,
>>> actions such as logs and correction can be taken.
>>> Sensors are sensing the health state and can trigger correction action.
>>>
>>> The sensors are divided into the following groups
>>> - Hardware sensor - a sensor which is triggered by the device due to
>>>    malfunction.
>>> - Software sensor - a sensor which is triggered by the software due to
>>>    malfunction.
>>> Both group of sensors can be triggered due to error event or due to a 
>>> periodic check.
>>>
>>> Actions are the way to handle sensor events. Action can be in one of the
>>> following groups:
>>> - Dump -  SW trace, SW dump, HW trace, HW dump
>>> - Reset - Surgical correction (e.g. modify Q, flush Q, reset of 
>>> device, etc)
>>> Actions can be performed by SW or HW.
>>>
>>> User is allowed to enable or disable sensors and sensor2action mapping.
>>>
>>> This RFC man page patch describes the suggested API of devlink-health 
>>> in order
>>> to control sensors and actions.
>>
>> I like the idea of configuring response to events like this, although
>> I'm not sure the name sensor is appropriate here - perhaps exception or
>> error would be better?
> 
> I was trying to avoid the negativity description. Have it called sensor 
> to avoid restricting the API for errors / exceptions only. I got the 
> same type of comment from Andrew as well devlink-health->devlink-bug.
> 
> But if other vendors driver developers don't see it can be expanded to 
> sensor which are not errors, then I guess we can refactor the names.
> 
> Are there going to be values reported?
> 
> It depends on the sensor. If it has data that would help in the debug, 
> then I assume yes, via the dumps.
> 
>>
>> I'm not so sure about HW sensors in relation to existing HWMON
>> infrastructure...  I assume you're targeting things like say some HW
>> engine/block reporting it encountered an error?  Sounds good, too.
> 
> yes, exactly.
> 
>>
>> Are the actions all envisioned to be performed by the driver?
>> Firmware?  Hardware?  I guess that distinction can be added later.
>> For FW/HW actions we would go back to the problem of persistence of
>> the setting since it was only implemented for params :S
> 
> The problem is not with FW action, the problem is when you try to set 
> sensor2action mapping for the FW/HW. this will need persistence 
> configuration mode. Sensor2action in SW shall be run-time mode (at least 
> as a start).
> But it sound as this need some more tuning, to make it clear.

Revisiting this (before sending V2). My guideline is that persistency 
inside the device is needed only when a persistence information is 
needed before the driver loads. For any other configuration (i.e post HW 
boot),  one can use standard Linux scripts in order to control its 
persistence information.

If any new sensor will be added that requires pre HW boot information, 
the API can be extended later.

> 
>>
>> Is the dump option going to tie back into region snapshots?
>>
> no necessarily, dumping SW objects as well can be helpful

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ