[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BYAPR18MB2679F855ADD7176A1587ED37C5F40@BYAPR18MB2679.namprd18.prod.outlook.com>
Date: Tue, 1 Dec 2020 05:23:23 +0000
From: George Cherian <gcherian@...vell.com>
To: Jakub Kicinski <kuba@...nel.org>
CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"davem@...emloft.net" <davem@...emloft.net>,
Sunil Kovvuri Goutham <sgoutham@...vell.com>,
Linu Cherian <lcherian@...vell.com>,
"Geethasowjanya Akula" <gakula@...vell.com>,
"masahiroy@...nel.org" <masahiroy@...nel.org>,
"willemdebruijn.kernel@...il.com" <willemdebruijn.kernel@...il.com>,
"saeed@...nel.org" <saeed@...nel.org>,
"jiri@...nulli.us" <jiri@...nulli.us>
Subject: RE: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters
for NPA
> -----Original Message-----
> From: George Cherian
> Sent: Tuesday, December 1, 2020 10:49 AM
> To: 'Jakub Kicinski' <kuba@...nel.org>
> Cc: 'netdev@...r.kernel.org' <netdev@...r.kernel.org>; 'linux-
> kernel@...r.kernel.org' <linux-kernel@...r.kernel.org>;
> 'davem@...emloft.net' <davem@...emloft.net>; Sunil Kovvuri Goutham
> <sgoutham@...vell.com>; Linu Cherian <lcherian@...vell.com>;
> Geethasowjanya Akula <gakula@...vell.com>; 'masahiroy@...nel.org'
> <masahiroy@...nel.org>; 'willemdebruijn.kernel@...il.com'
> <willemdebruijn.kernel@...il.com>; 'saeed@...nel.org'
> <saeed@...nel.org>; 'jiri@...nulli.us' <jiri@...nulli.us>
> Subject: RE: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> reporters for NPA
>
> Jakub,
>
> > -----Original Message-----
> > From: George Cherian
> > Sent: Tuesday, December 1, 2020 9:06 AM
> > To: Jakub Kicinski <kuba@...nel.org>
> > Cc: netdev@...r.kernel.org; linux-kernel@...r.kernel.org;
> > davem@...emloft.net; Sunil Kovvuri Goutham
> <sgoutham@...vell.com>;
> > Linu Cherian <lcherian@...vell.com>; Geethasowjanya Akula
> > <gakula@...vell.com>; masahiroy@...nel.org;
> > willemdebruijn.kernel@...il.com; saeed@...nel.org; jiri@...nulli.us
> > Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> > reporters for NPA
> >
> > Hi Jakub,
> >
> > > -----Original Message-----
> > > From: Jakub Kicinski <kuba@...nel.org>
> > > Sent: Tuesday, December 1, 2020 7:59 AM
> > > To: George Cherian <gcherian@...vell.com>
> > > Cc: netdev@...r.kernel.org; linux-kernel@...r.kernel.org;
> > > davem@...emloft.net; Sunil Kovvuri Goutham
> > <sgoutham@...vell.com>;
> > > Linu Cherian <lcherian@...vell.com>; Geethasowjanya Akula
> > > <gakula@...vell.com>; masahiroy@...nel.org;
> > > willemdebruijn.kernel@...il.com; saeed@...nel.org; jiri@...nulli.us
> > > Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> > > reporters for NPA
> > >
> > > On Thu, 26 Nov 2020 19:32:50 +0530 George Cherian wrote:
> > > > Add health reporters for RVU NPA block.
> > > > NPA Health reporters handle following HW event groups
> > > > - GENERAL events
> > > > - ERROR events
> > > > - RAS events
> > > > - RVU event
> > > > An event counter per event is maintained in SW.
> > > >
> > > > Output:
> > > > # devlink health
> > > > pci/0002:01:00.0:
> > > > reporter hw_npa
> > > > state healthy error 0 recover 0 # devlink health dump show
> > > > pci/0002:01:00.0 reporter hw_npa
> > > > NPA_AF_GENERAL:
> > > > Unmap PF Error: 0
> > > > NIX:
> > > > 0: free disabled RX: 0 free disabled TX: 0
> > > > 1: free disabled RX: 0 free disabled TX: 0
> > > > Free Disabled for SSO: 0
> > > > Free Disabled for TIM: 0
> > > > Free Disabled for DPI: 0
> > > > Free Disabled for AURA: 0
> > > > Alloc Disabled for Resvd: 0
> > > > NPA_AF_ERR:
> > > > Memory Fault on NPA_AQ_INST_S read: 0
> > > > Memory Fault on NPA_AQ_RES_S write: 0
> > > > AQ Doorbell Error: 0
> > > > Poisoned data on NPA_AQ_INST_S read: 0
> > > > Poisoned data on NPA_AQ_RES_S write: 0
> > > > Poisoned data on HW context read: 0
> > > > NPA_AF_RVU:
> > > > Unmap Slot Error: 0
> > >
> > > You seem to have missed the feedback Saeed and I gave you on v2.
> > >
> > > Did you test this with the errors actually triggering? Devlink
> > > should store only
> > Yes, the same was tested using devlink health test interface by
> > injecting errors.
> > The dump gets generated automatically and the counters do get out of
> > sync, in case of continuous error.
> > That wouldn't be much of an issue as the user could manually trigger a
> > dump clear and Re-dump the counters to get the exact status of the
> > counters at any point of time.
>
> Now that recover op is added the devlink error counter and recover counter
> will be proper. The internal counter for each event is needed just to
> understand within a specific reporter, how many such events occurred.
>
> Following is the log snippet of the devlink health test being done on hw_nix
> reporter.
> # for i in `seq 1 33` ; do devlink health test pci/0002:01:00.0 reporter hw_nix;
> done //Inject 33 errors (16 of NIX_AF_RVU and 17 of NIX_AF_RAS and
> NIX_AF_GENERAL errors) # devlink health
> pci/0002:01:00.0:
> reporter hw_npa
> state healthy error 0 recover 0 grace_period 0 auto_recover true
> auto_dump true
> reporter hw_nix
> state healthy error 250 recover 250 last_dump_date 1970-01-01
> last_dump_time 00:04:16 grace_period 0 auto_recover true auto_dump true
Oops, There was a log copy paste error above its not 250 (that was from a run, in which test was done
for 250 error injections)
# devlink health
pci/0002:01:00.0:
reporter hw_npa
state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
reporter hw_nix
state healthy error 33 recover 33 last_dump_date 1970-01-01 last_dump_time 00:02:16 grace_period 0 auto_recover true auto_dump true
> # devlink health dump show pci/0002:01:00.0 reporter hw_nix
> NIX_AF_GENERAL:
> Memory Fault on NIX_AQ_INST_S read: 1
> Memory Fault on NIX_AQ_RES_S write: 1
> AQ Doorbell error: 1
> Rx on unmapped PF_FUNC: 1
> Rx multicast replication error: 1
> Memory fault on NIX_RX_MCE_S read: 1
> Memory fault on multicast WQE read: 1
> Memory fault on mirror WQE read: 1
> Memory fault on mirror pkt write: 1
> Memory fault on multicast pkt write: 1
> NIX_AF_RAS:
> Poisoned data on NIX_AQ_INST_S read: 1
> Poisoned data on NIX_AQ_RES_S write: 1
> Poisoned data on HW context read: 1
> Poisoned data on packet read from mirror buffer: 1
> Poisoned data on packet read from mcast buffer: 1
> Poisoned data on WQE read from mirror buffer: 1
> Poisoned data on WQE read from multicast buffer: 1
> Poisoned data on NIX_RX_MCE_S read: 1
> NIX_AF_RVU:
> Unmap Slot Error: 0
> # devlink health dump clear pci/0002:01:00.0 reporter hw_nix # devlink
> health dump show pci/0002:01:00.0 reporter hw_nix
> NIX_AF_GENERAL:
> Memory Fault on NIX_AQ_INST_S read: 17
> Memory Fault on NIX_AQ_RES_S write: 17
> AQ Doorbell error: 17
> Rx on unmapped PF_FUNC: 17
> Rx multicast replication error: 17
> Memory fault on NIX_RX_MCE_S read: 17
> Memory fault on multicast WQE read: 17
> Memory fault on mirror WQE read: 17
> Memory fault on mirror pkt write: 17
> Memory fault on multicast pkt write: 17
> NIX_AF_RAS:
> Poisoned data on NIX_AQ_INST_S read: 17
> Poisoned data on NIX_AQ_RES_S write: 17
> Poisoned data on HW context read: 17
> Poisoned data on packet read from mirror buffer: 17
> Poisoned data on packet read from mcast buffer: 17
> Poisoned data on WQE read from mirror buffer: 17
> Poisoned data on WQE read from multicast buffer: 17
> Poisoned data on NIX_RX_MCE_S read: 17
> NIX_AF_RVU:
> Unmap Slot Error: 16
> >
> > > one dump, are the counters not going to get out of sync unless
> > > something clears the dump every time it triggers?
> Also, note that auto_dump is something which can be turned off by user.
> # devlink health set pci/0002:01:00.0 reporter hw_nix auto_dump false So
> that user can dump whenever required, which will always return the correct
> counter values.
>
> >
> > Regards,
> > -George
Powered by blists - more mailing lists