[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6aec53bc-5b5c-73bc-5ad7-958995292d9e@mellanox.com>
Date: Thu, 27 Sep 2018 18:04:02 +0300
From: Eran Ben Elisha <eranbe@...lanox.com>
To: Jiri Pirko <jiri@...nulli.us>
Cc: netdev@...r.kernel.org,
Jakub Kicinski <jakub.kicinski@...ronome.com>,
Jiri Pirko <jiri@...lanox.com>,
Stephen Hemminger <stephen@...workplumber.org>,
Andrew Lunn <andrew@...n.ch>, "Tobin C. Harding" <me@...in.cc>,
Ariel Almog <ariela@...lanox.com>,
Tal Alon <talal@...lanox.com>
Subject: Re: [RFC PATCH iproute2-next V2] System specification exception API
On 9/27/2018 5:34 PM, Jiri Pirko wrote:
> Thu, Sep 27, 2018 at 04:02:48PM CEST, eranbe@...lanox.com wrote:
>>
>>
>> On 9/27/2018 3:47 PM, Jiri Pirko wrote:
>>> Wed, Sep 26, 2018 at 01:52:58PM CEST, eranbe@...lanox.com wrote:
>>>> The exception spec is targeted for Real Time Alerting, in order to know when
>>>> something bad had happened to a PCI device
>>>> - Provide alert debug information
>>>> - Self healing
>>>> - If problem needs vendor support, provide a way to gather all needed debugging
>>>> information.
>>>>
>>>> The exception mechanism contains condition checkers which sense for malfunction. Upon a condition hit,
>>>> actions such as logs and correction can be taken.
>>>>
>>>> The condition checkers are divided into the following groups
>>>> - Hardware - a checker which is triggered by the device due to
>>>> malfunction.
>>>> - Software - a checker which is triggered by the software due to
>>>> malfunction.
>>>
>>> What do you mean by a "software malfunction", a "FW malfunction"?
>>> Also, I don't see this 2 groups in the man.
>>
>> Software malfunction can be a Transmit error (caused by bad send request).
>
> Sorry, but I still don't undestand what "software malfuntion" are you
> talking about. Could you be more specific please?
* Driver is building a bad send Work request (bug in driver, bug in
packet generator, etc). When it sends it, it gets back an error
completion from the HW. This error might cause the HW Queue to be in
error state and cannot be used again until it is being "recovered".
Condition: Error completion
Action: Queue recover
The entire scenario is due to SW malfunction.
* Driver is trying to configure HW QoS register bug failed by the FW.
Condition: command execution error
Action: Dump of command + Dump of SW internal related DB + Dump of FW
related DB
* Another existing example is the ndo_tx_timeout routine. (This is being
done in the networking stuck layer, and can be configured today from a
sysfs). If a vendor driver has other specific checking routine like this
one in its driver (which he needs to configure from userspace), then it
can handled via devlink-exception and be tagged as a software condition.
>
>
>> FW/HW malfunction can be any catastrophic error report (the ones that should
>> be exposed to driver).
>> The comment here was to highlight that we can support different kinds of
>> condition groups.
>> If for a specific condition, we will need to highlight it is SW/HW, we can
>> concatenate it to its name.
>>
>> Eran
>>
>>>>
Powered by blists - more mailing lists