[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200511141113.GP11244@42.do-not-panic.com>
Date: Mon, 11 May 2020 14:11:13 +0000
From: Luis Chamberlain <mcgrof@...nel.org>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Jiri Pirko <jiri@...nulli.us>, jeyu@...nel.org,
akpm@...ux-foundation.org, arnd@...db.de, rostedt@...dmis.org,
mingo@...hat.com, aquini@...hat.com, cai@....pw, dyoung@...hat.com,
bhe@...hat.com, peterz@...radead.org, tglx@...utronix.de,
gpiccoli@...onical.com, pmladek@...e.com, tiwai@...e.de,
schlad@...e.de, andriy.shevchenko@...ux.intel.com,
keescook@...omium.org, daniel.vetter@...ll.ch, will@...nel.org,
mchehab+samsung@...nel.org, kvalo@...eaurora.org,
davem@...emloft.net, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 00/15] net: taint when the device driver firmware crashes
On Sat, May 09, 2020 at 11:35:46AM -0700, Jakub Kicinski wrote:
> On Sat, 9 May 2020 04:35:37 +0000 Luis Chamberlain wrote:
> > Device driver firmware can crash, and sometimes, this can leave your
> > system in a state which makes the device or subsystem completely
> > useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
> > of scraping some magical words from the kernel log, which is driver
> > specific, is much easier. So instead this series provides a helper which
> > lets drivers annotate this and shows how to use this on networking
> > drivers.
> >
> > My methodology for finding when firmware crashes is to git grep for
> > "crash" and then doing some study of the code to see if this indeed
> > a place where the firmware crashes. In some places this is quite
> > obvious.
> >
> > I'm starting off with networking first, if this gets merged later on I
> > can focus on the other drivers, but I already have some work done on
> > other subsytems.
> >
> > Review, flames, etc are greatly appreciated.
>
> Tainting itself may be useful, but that's just the first step. I'd much
> rather see folks start using the devlink health infrastructure. Devlink
> is netlink based, but it's _not_ networking specific (many of its
> optional features obviously are, but don't let that mislead you).
>
> With devlink health we get (a) a standard notification on the failure;
> (b) information/state dump in a (somewhat) structured form, which can be
> collected & shared with vendors; (c) automatic remediation (usually
> device reset of some scope).
It indeed sounds very useful!
> Now regarding the tainting - as I said it may be useful, but don't we
> have to define what constitutes a "firmware crash"?
Yes indeed, I missed clarifying this in the documentation. I'll do so
in my next respin.
> There are many
> failure modes, some perfectly recoverable (e.g. processing queue hang),
> some mere bugs (e.g. device fails to initialize some functions). All of
> them may impact the functioning of the system. How do we choose those
> that taint?
Its up to the maintainers of the device driver, what I was aiming for
were those firmware crashes which indeed *can* have an impact on user
experience, and can *even* potentially require a driver removal / addition
to to get things back in order again.
Luis
Powered by blists - more mailing lists