lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Tue, 1 Dec 2020 05:23:23 +0000 From: George Cherian <gcherian@...vell.com> To: Jakub Kicinski <kuba@...nel.org> CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "davem@...emloft.net" <davem@...emloft.net>, Sunil Kovvuri Goutham <sgoutham@...vell.com>, Linu Cherian <lcherian@...vell.com>, "Geethasowjanya Akula" <gakula@...vell.com>, "masahiroy@...nel.org" <masahiroy@...nel.org>, "willemdebruijn.kernel@...il.com" <willemdebruijn.kernel@...il.com>, "saeed@...nel.org" <saeed@...nel.org>, "jiri@...nulli.us" <jiri@...nulli.us> Subject: RE: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA > -----Original Message----- > From: George Cherian > Sent: Tuesday, December 1, 2020 10:49 AM > To: 'Jakub Kicinski' <kuba@...nel.org> > Cc: 'netdev@...r.kernel.org' <netdev@...r.kernel.org>; 'linux- > kernel@...r.kernel.org' <linux-kernel@...r.kernel.org>; > 'davem@...emloft.net' <davem@...emloft.net>; Sunil Kovvuri Goutham > <sgoutham@...vell.com>; Linu Cherian <lcherian@...vell.com>; > Geethasowjanya Akula <gakula@...vell.com>; 'masahiroy@...nel.org' > <masahiroy@...nel.org>; 'willemdebruijn.kernel@...il.com' > <willemdebruijn.kernel@...il.com>; 'saeed@...nel.org' > <saeed@...nel.org>; 'jiri@...nulli.us' <jiri@...nulli.us> > Subject: RE: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health > reporters for NPA > > Jakub, > > > -----Original Message----- > > From: George Cherian > > Sent: Tuesday, December 1, 2020 9:06 AM > > To: Jakub Kicinski <kuba@...nel.org> > > Cc: netdev@...r.kernel.org; linux-kernel@...r.kernel.org; > > davem@...emloft.net; Sunil Kovvuri Goutham > <sgoutham@...vell.com>; > > Linu Cherian <lcherian@...vell.com>; Geethasowjanya Akula > > <gakula@...vell.com>; masahiroy@...nel.org; > > willemdebruijn.kernel@...il.com; saeed@...nel.org; jiri@...nulli.us > > Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health > > reporters for NPA > > > > Hi Jakub, > > > > > -----Original Message----- > > > From: Jakub Kicinski <kuba@...nel.org> > > > Sent: Tuesday, December 1, 2020 7:59 AM > > > To: George Cherian <gcherian@...vell.com> > > > Cc: netdev@...r.kernel.org; linux-kernel@...r.kernel.org; > > > davem@...emloft.net; Sunil Kovvuri Goutham > > <sgoutham@...vell.com>; > > > Linu Cherian <lcherian@...vell.com>; Geethasowjanya Akula > > > <gakula@...vell.com>; masahiroy@...nel.org; > > > willemdebruijn.kernel@...il.com; saeed@...nel.org; jiri@...nulli.us > > > Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health > > > reporters for NPA > > > > > > On Thu, 26 Nov 2020 19:32:50 +0530 George Cherian wrote: > > > > Add health reporters for RVU NPA block. > > > > NPA Health reporters handle following HW event groups > > > > - GENERAL events > > > > - ERROR events > > > > - RAS events > > > > - RVU event > > > > An event counter per event is maintained in SW. > > > > > > > > Output: > > > > # devlink health > > > > pci/0002:01:00.0: > > > > reporter hw_npa > > > > state healthy error 0 recover 0 # devlink health dump show > > > > pci/0002:01:00.0 reporter hw_npa > > > > NPA_AF_GENERAL: > > > > Unmap PF Error: 0 > > > > NIX: > > > > 0: free disabled RX: 0 free disabled TX: 0 > > > > 1: free disabled RX: 0 free disabled TX: 0 > > > > Free Disabled for SSO: 0 > > > > Free Disabled for TIM: 0 > > > > Free Disabled for DPI: 0 > > > > Free Disabled for AURA: 0 > > > > Alloc Disabled for Resvd: 0 > > > > NPA_AF_ERR: > > > > Memory Fault on NPA_AQ_INST_S read: 0 > > > > Memory Fault on NPA_AQ_RES_S write: 0 > > > > AQ Doorbell Error: 0 > > > > Poisoned data on NPA_AQ_INST_S read: 0 > > > > Poisoned data on NPA_AQ_RES_S write: 0 > > > > Poisoned data on HW context read: 0 > > > > NPA_AF_RVU: > > > > Unmap Slot Error: 0 > > > > > > You seem to have missed the feedback Saeed and I gave you on v2. > > > > > > Did you test this with the errors actually triggering? Devlink > > > should store only > > Yes, the same was tested using devlink health test interface by > > injecting errors. > > The dump gets generated automatically and the counters do get out of > > sync, in case of continuous error. > > That wouldn't be much of an issue as the user could manually trigger a > > dump clear and Re-dump the counters to get the exact status of the > > counters at any point of time. > > Now that recover op is added the devlink error counter and recover counter > will be proper. The internal counter for each event is needed just to > understand within a specific reporter, how many such events occurred. > > Following is the log snippet of the devlink health test being done on hw_nix > reporter. > # for i in `seq 1 33` ; do devlink health test pci/0002:01:00.0 reporter hw_nix; > done //Inject 33 errors (16 of NIX_AF_RVU and 17 of NIX_AF_RAS and > NIX_AF_GENERAL errors) # devlink health > pci/0002:01:00.0: > reporter hw_npa > state healthy error 0 recover 0 grace_period 0 auto_recover true > auto_dump true > reporter hw_nix > state healthy error 250 recover 250 last_dump_date 1970-01-01 > last_dump_time 00:04:16 grace_period 0 auto_recover true auto_dump true Oops, There was a log copy paste error above its not 250 (that was from a run, in which test was done for 250 error injections) # devlink health pci/0002:01:00.0: reporter hw_npa state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true reporter hw_nix state healthy error 33 recover 33 last_dump_date 1970-01-01 last_dump_time 00:02:16 grace_period 0 auto_recover true auto_dump true > # devlink health dump show pci/0002:01:00.0 reporter hw_nix > NIX_AF_GENERAL: > Memory Fault on NIX_AQ_INST_S read: 1 > Memory Fault on NIX_AQ_RES_S write: 1 > AQ Doorbell error: 1 > Rx on unmapped PF_FUNC: 1 > Rx multicast replication error: 1 > Memory fault on NIX_RX_MCE_S read: 1 > Memory fault on multicast WQE read: 1 > Memory fault on mirror WQE read: 1 > Memory fault on mirror pkt write: 1 > Memory fault on multicast pkt write: 1 > NIX_AF_RAS: > Poisoned data on NIX_AQ_INST_S read: 1 > Poisoned data on NIX_AQ_RES_S write: 1 > Poisoned data on HW context read: 1 > Poisoned data on packet read from mirror buffer: 1 > Poisoned data on packet read from mcast buffer: 1 > Poisoned data on WQE read from mirror buffer: 1 > Poisoned data on WQE read from multicast buffer: 1 > Poisoned data on NIX_RX_MCE_S read: 1 > NIX_AF_RVU: > Unmap Slot Error: 0 > # devlink health dump clear pci/0002:01:00.0 reporter hw_nix # devlink > health dump show pci/0002:01:00.0 reporter hw_nix > NIX_AF_GENERAL: > Memory Fault on NIX_AQ_INST_S read: 17 > Memory Fault on NIX_AQ_RES_S write: 17 > AQ Doorbell error: 17 > Rx on unmapped PF_FUNC: 17 > Rx multicast replication error: 17 > Memory fault on NIX_RX_MCE_S read: 17 > Memory fault on multicast WQE read: 17 > Memory fault on mirror WQE read: 17 > Memory fault on mirror pkt write: 17 > Memory fault on multicast pkt write: 17 > NIX_AF_RAS: > Poisoned data on NIX_AQ_INST_S read: 17 > Poisoned data on NIX_AQ_RES_S write: 17 > Poisoned data on HW context read: 17 > Poisoned data on packet read from mirror buffer: 17 > Poisoned data on packet read from mcast buffer: 17 > Poisoned data on WQE read from mirror buffer: 17 > Poisoned data on WQE read from multicast buffer: 17 > Poisoned data on NIX_RX_MCE_S read: 17 > NIX_AF_RVU: > Unmap Slot Error: 16 > > > > > one dump, are the counters not going to get out of sync unless > > > something clears the dump every time it triggers? > Also, note that auto_dump is something which can be turned off by user. > # devlink health set pci/0002:01:00.0 reporter hw_nix auto_dump false So > that user can dump whenever required, which will always return the correct > counter values. > > > > > Regards, > > -George
Powered by blists - more mailing lists