lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 5 Nov 2020 16:23:57 -0800 From: Jakub Kicinski <kuba@...nel.org> To: Saeed Mahameed <saeed@...nel.org> Cc: George Cherian <gcherian@...vell.com>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>, "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, Jiri Pirko <jiri@...dia.com>, "davem@...emloft.net" <davem@...emloft.net>, Sunil Kovvuri Goutham <sgoutham@...vell.com>, Linu Cherian <lcherian@...vell.com>, Geethasowjanya Akula <gakula@...vell.com>, "masahiroy@...nel.org" <masahiroy@...nel.org>, "willemdebruijn.kernel@...il.com" <willemdebruijn.kernel@...il.com> Subject: Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX On Thu, 05 Nov 2020 15:52:32 -0800 Saeed Mahameed wrote: > On Thu, 2020-11-05 at 12:42 -0800, Jakub Kicinski wrote: > > On Thu, 05 Nov 2020 11:23:54 -0800 Saeed Mahameed wrote: > > > If you report an error without recovering, devlink health will > > > report a > > > bad device state > > > > > > $ ./devlink health > > > pci/0002:01:00.0: > > > reporter npa > > > state error error 1 recover 0 > > > > Actually, the counter in the driver is unnecessary, right? Devlink > > counts errors. > > if you mean error and recover counters, then yes. they are managed by > devlink health > > every call to dl-health-report will do: > > devlink_health_report(reporter, err_ctx, msg) > { > reproter.error++; > > devlink_trigger_event(reporter, msg); > > reporter.dump(err_ctx, msg); > reporter.diag(err_ctx); > > if (!reporter.recover(err_ctx)) > reporter.recover++; > } > > so dl-health reports without a recover op will confuse the user if user > sees error count > recover count. > > error count should only be grater than recover count when recover > procedure fails which now will indicate the device is not in a healthy > state. Good point, as is the internal devlink counter mismatch looks pretty strange. > also i want to clarify one small note about devlink dump. > > devlink health dump semantics: > on devlink health dump, the devlink health will check if previous dump > exists and will just return it without actually calling the driver, if > not then it will call the driver to perform a new dump and will cache > it. > > user has to explicitly clear the devlink health dump of that reporter > in order to allow for newer dump to get generated. > > this is done this way because we want the driver to store the dump of > the previously reported errors at the moment the erorrs are reported by > driver, so when a user issue a dump command the dump of the previous > error will be reported to user form memory without the need to access > driver/hw who might be in a bad state. > > so this is why using devlink dump for reporting counters doesn't really > work, it will only report the first time the counters are accessed via > devlink health dump, after that it will report the same cached values > over and over until the user clears it up. Agreed, if only counters are reported driver should rely on the devlink counters. Dump contents are for context of the event. > > > So you will need to implement an empty recover op. > > > so if these events are informational only and they don't indicate > > > device health issues, why would you report them via devlink health > > > ? > > > > I see devlink health reporters a way of collecting errors reports > > which > > for the most part are just shared with the vendor. IOW firmware (or > > hardware) bugs. > > > > Obviously as you say without recover and additional context in the > > report the value is quite diminished. But _if_ these are indeed > > "report > > me to the vendor" kind of events then at least they should use our > > current mechanics for such reports - which is dl-health. > > > > Without knowing what these events are it's quite hard to tell if > > devlink health is an overkill or counter is sufficient. > > > > Either way - printing these to the logs is definitely the worst > > choice > > :) > > Sure, I don't mind using devlink health for dump only, I don't really > have strong feelings against this, they can always extend it in the > future. > > it just doesn't make sense to me to have it mainly used for dumping > counters and without using devlik helath utilities, like events, > reports and recover. > > so maybe Sunil et al. could polish this patchset and provide more > devlink health support, like diagnose for these errors, dump HW > information and contexts related to these errors so they could debug > root causes, etc .. > Then the use for dl health in this series can be truly justified. That'd indeed be optimal.
Powered by blists - more mailing lists