linux-kernel - Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20201201105933.7b22d119@kicinski-fedora-pc1c0hjn.DHCP.thefacebook.com>
Date:   Tue, 1 Dec 2020 10:59:33 -0800
From:   Jakub Kicinski <kuba@...nel.org>
To:     George Cherian <gcherian@...vell.com>
Cc:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "davem@...emloft.net" <davem@...emloft.net>,
        Sunil Kovvuri Goutham <sgoutham@...vell.com>,
        Linu Cherian <lcherian@...vell.com>,
        "Geethasowjanya Akula" <gakula@...vell.com>,
        "masahiroy@...nel.org" <masahiroy@...nel.org>,
        "willemdebruijn.kernel@...il.com" <willemdebruijn.kernel@...il.com>,
        "saeed@...nel.org" <saeed@...nel.org>,
        "jiri@...nulli.us" <jiri@...nulli.us>
Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
 reporters for NPA

On Tue, 1 Dec 2020 05:23:23 +0000 George Cherian wrote:
> > > > You seem to have missed the feedback Saeed and I gave you on v2.
> > > >
> > > > Did you test this with the errors actually triggering? Devlink
> > > > should store only  
> > > Yes, the same was tested using devlink health test interface by
> > > injecting errors.
> > > The dump gets generated automatically and the counters do get out of
> > > sync, in case of continuous error.
> > > That wouldn't be much of an issue as the user could manually trigger a
> > > dump clear and Re-dump the counters to get the exact status of the
> > > counters at any point of time.  
> > 
> > Now that recover op is added the devlink error counter and recover counter
> > will be proper. The internal counter for each event is needed just to
> > understand within a specific reporter, how many such events occurred.
> > 
> > Following is the log snippet of the devlink health test being done on hw_nix
> > reporter.
> > # for i in `seq 1 33` ; do  devlink health test pci/0002:01:00.0 reporter hw_nix;
> > done //Inject 33 errors (16  of NIX_AF_RVU and 17 of NIX_AF_RAS and
> > NIX_AF_GENERAL errors) # devlink health
> > pci/0002:01:00.0:
> >   reporter hw_npa
> >     state healthy error 0 recover 0 grace_period 0 auto_recover true
> > auto_dump true
> >   reporter hw_nix
> >     state healthy error 250 recover 250 last_dump_date 1970-01-01
> > last_dump_time 00:04:16 grace_period 0 auto_recover true auto_dump true  
> Oops, There was a log copy paste error above its not 250 (that was from a run, in which test was done
> for 250 error injections)  
> # devlink health
> pci/0002:01:00.0:
>   reporter hw_npa
>     state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
>   reporter hw_nix
>     state healthy error 33 recover 33

I thought it'd be better to just add each error as its own reporter
rather than combining them and abusing context for reporting detailed
stats.

This seems to be harder to get done than I thought. Maybe just go back
to the prints and we can move on.

> last_dump_date 1970-01-01 last_dump_time 00:02:16 grace_period 0 auto_recover true auto_dump true

Why the weird date? Is this something on your system?