lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 1 Dec 2020 05:23:23 +0000
From:   George Cherian <gcherian@...vell.com>
To:     Jakub Kicinski <kuba@...nel.org>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "davem@...emloft.net" <davem@...emloft.net>,
        Sunil Kovvuri Goutham <sgoutham@...vell.com>,
        Linu Cherian <lcherian@...vell.com>,
        "Geethasowjanya Akula" <gakula@...vell.com>,
        "masahiroy@...nel.org" <masahiroy@...nel.org>,
        "willemdebruijn.kernel@...il.com" <willemdebruijn.kernel@...il.com>,
        "saeed@...nel.org" <saeed@...nel.org>,
        "jiri@...nulli.us" <jiri@...nulli.us>
Subject: RE: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters
 for NPA



> -----Original Message-----
> From: George Cherian
> Sent: Tuesday, December 1, 2020 10:49 AM
> To: 'Jakub Kicinski' <kuba@...nel.org>
> Cc: 'netdev@...r.kernel.org' <netdev@...r.kernel.org>; 'linux-
> kernel@...r.kernel.org' <linux-kernel@...r.kernel.org>;
> 'davem@...emloft.net' <davem@...emloft.net>; Sunil Kovvuri Goutham
> <sgoutham@...vell.com>; Linu Cherian <lcherian@...vell.com>;
> Geethasowjanya Akula <gakula@...vell.com>; 'masahiroy@...nel.org'
> <masahiroy@...nel.org>; 'willemdebruijn.kernel@...il.com'
> <willemdebruijn.kernel@...il.com>; 'saeed@...nel.org'
> <saeed@...nel.org>; 'jiri@...nulli.us' <jiri@...nulli.us>
> Subject: RE: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> reporters for NPA
> 
> Jakub,
> 
> > -----Original Message-----
> > From: George Cherian
> > Sent: Tuesday, December 1, 2020 9:06 AM
> > To: Jakub Kicinski <kuba@...nel.org>
> > Cc: netdev@...r.kernel.org; linux-kernel@...r.kernel.org;
> > davem@...emloft.net; Sunil Kovvuri Goutham
> <sgoutham@...vell.com>;
> > Linu Cherian <lcherian@...vell.com>; Geethasowjanya Akula
> > <gakula@...vell.com>; masahiroy@...nel.org;
> > willemdebruijn.kernel@...il.com; saeed@...nel.org; jiri@...nulli.us
> > Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> > reporters for NPA
> >
> > Hi Jakub,
> >
> > > -----Original Message-----
> > > From: Jakub Kicinski <kuba@...nel.org>
> > > Sent: Tuesday, December 1, 2020 7:59 AM
> > > To: George Cherian <gcherian@...vell.com>
> > > Cc: netdev@...r.kernel.org; linux-kernel@...r.kernel.org;
> > > davem@...emloft.net; Sunil Kovvuri Goutham
> > <sgoutham@...vell.com>;
> > > Linu Cherian <lcherian@...vell.com>; Geethasowjanya Akula
> > > <gakula@...vell.com>; masahiroy@...nel.org;
> > > willemdebruijn.kernel@...il.com; saeed@...nel.org; jiri@...nulli.us
> > > Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> > > reporters for NPA
> > >
> > > On Thu, 26 Nov 2020 19:32:50 +0530 George Cherian wrote:
> > > > Add health reporters for RVU NPA block.
> > > > NPA Health reporters handle following HW event groups
> > > >  - GENERAL events
> > > >  - ERROR events
> > > >  - RAS events
> > > >  - RVU event
> > > > An event counter per event is maintained in SW.
> > > >
> > > > Output:
> > > >  # devlink health
> > > >  pci/0002:01:00.0:
> > > >    reporter hw_npa
> > > >      state healthy error 0 recover 0  # devlink  health dump show
> > > > pci/0002:01:00.0 reporter hw_npa
> > > >  NPA_AF_GENERAL:
> > > >         Unmap PF Error: 0
> > > >         NIX:
> > > >         0: free disabled RX: 0 free disabled TX: 0
> > > >         1: free disabled RX: 0 free disabled TX: 0
> > > >         Free Disabled for SSO: 0
> > > >         Free Disabled for TIM: 0
> > > >         Free Disabled for DPI: 0
> > > >         Free Disabled for AURA: 0
> > > >         Alloc Disabled for Resvd: 0
> > > >   NPA_AF_ERR:
> > > >         Memory Fault on NPA_AQ_INST_S read: 0
> > > >         Memory Fault on NPA_AQ_RES_S write: 0
> > > >         AQ Doorbell Error: 0
> > > >         Poisoned data on NPA_AQ_INST_S read: 0
> > > >         Poisoned data on NPA_AQ_RES_S write: 0
> > > >         Poisoned data on HW context read: 0
> > > >   NPA_AF_RVU:
> > > >         Unmap Slot Error: 0
> > >
> > > You seem to have missed the feedback Saeed and I gave you on v2.
> > >
> > > Did you test this with the errors actually triggering? Devlink
> > > should store only
> > Yes, the same was tested using devlink health test interface by
> > injecting errors.
> > The dump gets generated automatically and the counters do get out of
> > sync, in case of continuous error.
> > That wouldn't be much of an issue as the user could manually trigger a
> > dump clear and Re-dump the counters to get the exact status of the
> > counters at any point of time.
> 
> Now that recover op is added the devlink error counter and recover counter
> will be proper. The internal counter for each event is needed just to
> understand within a specific reporter, how many such events occurred.
> 
> Following is the log snippet of the devlink health test being done on hw_nix
> reporter.
> # for i in `seq 1 33` ; do  devlink health test pci/0002:01:00.0 reporter hw_nix;
> done //Inject 33 errors (16  of NIX_AF_RVU and 17 of NIX_AF_RAS and
> NIX_AF_GENERAL errors) # devlink health
> pci/0002:01:00.0:
>   reporter hw_npa
>     state healthy error 0 recover 0 grace_period 0 auto_recover true
> auto_dump true
>   reporter hw_nix
>     state healthy error 250 recover 250 last_dump_date 1970-01-01
> last_dump_time 00:04:16 grace_period 0 auto_recover true auto_dump true
Oops, There was a log copy paste error above its not 250 (that was from a run, in which test was done
for 250 error injections)  
# devlink health
pci/0002:01:00.0:
  reporter hw_npa
    state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
  reporter hw_nix
    state healthy error 33 recover 33 last_dump_date 1970-01-01 last_dump_time 00:02:16 grace_period 0 auto_recover true auto_dump true

> # devlink health dump show pci/0002:01:00.0 reporter hw_nix
> NIX_AF_GENERAL:
>         Memory Fault on NIX_AQ_INST_S read: 1
>         Memory Fault on NIX_AQ_RES_S write: 1
>         AQ Doorbell error: 1
>         Rx on unmapped PF_FUNC: 1
>         Rx multicast replication error: 1
>         Memory fault on NIX_RX_MCE_S read: 1
>         Memory fault on multicast WQE read: 1
>         Memory fault on mirror WQE read: 1
>         Memory fault on mirror pkt write: 1
>         Memory fault on multicast pkt write: 1
>   NIX_AF_RAS:
>         Poisoned data on NIX_AQ_INST_S read: 1
>         Poisoned data on NIX_AQ_RES_S write: 1
>         Poisoned data on HW context read: 1
>         Poisoned data on packet read from mirror buffer: 1
>         Poisoned data on packet read from mcast buffer: 1
>         Poisoned data on WQE read from mirror buffer: 1
>         Poisoned data on WQE read from multicast buffer: 1
>         Poisoned data on NIX_RX_MCE_S read: 1
>   NIX_AF_RVU:
>         Unmap Slot Error: 0
> # devlink health dump clear pci/0002:01:00.0 reporter hw_nix # devlink
> health dump show pci/0002:01:00.0 reporter hw_nix
> NIX_AF_GENERAL:
>         Memory Fault on NIX_AQ_INST_S read: 17
>         Memory Fault on NIX_AQ_RES_S write: 17
>         AQ Doorbell error: 17
>         Rx on unmapped PF_FUNC: 17
>         Rx multicast replication error: 17
>         Memory fault on NIX_RX_MCE_S read: 17
>         Memory fault on multicast WQE read: 17
>         Memory fault on mirror WQE read: 17
>         Memory fault on mirror pkt write: 17
>         Memory fault on multicast pkt write: 17
>   NIX_AF_RAS:
>         Poisoned data on NIX_AQ_INST_S read: 17
>         Poisoned data on NIX_AQ_RES_S write: 17
>         Poisoned data on HW context read: 17
>         Poisoned data on packet read from mirror buffer: 17
>         Poisoned data on packet read from mcast buffer: 17
>         Poisoned data on WQE read from mirror buffer: 17
>         Poisoned data on WQE read from multicast buffer: 17
>         Poisoned data on NIX_RX_MCE_S read: 17
>   NIX_AF_RVU:
>         Unmap Slot Error: 16
> >
> > > one dump, are the counters not going to get out of sync unless
> > > something clears the dump every time it triggers?
> Also, note that auto_dump is something which can be turned off by user.
> # devlink health set pci/0002:01:00.0 reporter hw_nix auto_dump false So
> that user can dump whenever required, which will always return the correct
> counter values.
> 
> >
> > Regards,
> > -George

Powered by blists - more mailing lists