netdev - Re: [EXT] [PATCH 09/15] qed: use new module_firmware

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200509164229.GJ11244@42.do-not-panic.com>
Date:   Sat, 9 May 2020 16:42:29 +0000
From:   Luis Chamberlain <mcgrof@...nel.org>
To:     Igor Russkikh <irusskikh@...vell.com>
Cc:     jeyu@...nel.org, akpm@...ux-foundation.org, arnd@...db.de,
        rostedt@...dmis.org, mingo@...hat.com, aquini@...hat.com,
        cai@....pw, dyoung@...hat.com, bhe@...hat.com,
        peterz@...radead.org, tglx@...utronix.de, gpiccoli@...onical.com,
        pmladek@...e.com, tiwai@...e.de, schlad@...e.de,
        andriy.shevchenko@...ux.intel.com, keescook@...omium.org,
        daniel.vetter@...ll.ch, will@...nel.org,
        mchehab+samsung@...nel.org, kvalo@...eaurora.org,
        davem@...emloft.net, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org, Ariel Elior <aelior@...vell.com>,
        GR-everest-linux-l2 <GR-everest-linux-l2@...vell.com>
Subject: Re: [EXT] [PATCH 09/15] qed: use new module_firmware_crashed()

On Sat, May 09, 2020 at 09:32:51AM +0300, Igor Russkikh wrote:
> 
> > This makes use of the new module_firmware_crashed() to help
> > annotate when firmware for device drivers crash. When firmware
> > crashes devices can sometimes become unresponsive, and recovery
> > sometimes requires a driver unload / reload and in the worst cases
> > a reboot.
> > 
> > Using a taint flag allows us to annotate when this happens clearly.
> > 
> > Cc: Ariel Elior <aelior@...vell.com>
> > Cc: GR-everest-linux-l2@...vell.com
> > Signed-off-by: Luis Chamberlain <mcgrof@...nel.org>
> > ---
> >  drivers/net/ethernet/qlogic/qed/qed_debug.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c
> > b/drivers/net/ethernet/qlogic/qed/qed_debug.c
> > index f4eebaabb6d0..9cc6287b889b 100644
> > --- a/drivers/net/ethernet/qlogic/qed/qed_debug.c
> > +++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c
> > @@ -7854,6 +7854,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void
> > *buffer)
> >  						 REGDUMP_HEADER_SIZE,
> >  						 &feature_size);
> >  		if (!rc) {
> > +			module_firmware_crashed();
> >  			*(u32 *)((u8 *)buffer + offset) =
> >  			    qed_calc_regdump_header(cdev,
> > PROTECTION_OVERRIDE,
> >  						    cur_engine,
> > @@ -7869,6 +7870,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void
> > *buffer)
> >  		rc = qed_dbg_fw_asserts(cdev, (u8 *)buffer + offset +
> >  					REGDUMP_HEADER_SIZE,
> > &feature_size);
> >  		if (!rc) {
> > +			module_firmware_crashed();
> >  			*(u32 *)((u8 *)buffer + offset) =
> >  			    qed_calc_regdump_header(cdev, FW_ASSERTS,
> >  						    cur_engine,
> > feature_size,
> > @@ -7906,6 +7908,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void
> > *buffer)
> >  		rc = qed_dbg_grc(cdev, (u8 *)buffer + offset +
> >  				 REGDUMP_HEADER_SIZE, &feature_size);
> >  		if (!rc) {
> > +			module_firmware_crashed();
> >  			*(u32 *)((u8 *)buffer + offset) =
> >  			    qed_calc_regdump_header(cdev, GRC_DUMP,
> >  						    cur_engine,
> 
> 
> Hi Luis,
> 
> qed_dbg_all_data is being used to gather debug dump from device. Failures
> inside it may happen due to various reasons, but they normally do not indicate
> FW failure.
> 
> So I think its not a good place to insert this call.
> Its hard to find exact good place to insert it in qed.

Is there a way to check if what happened was indeed a fw crash?

> One more thing is that AFAIU taint flag gets permanent on kernel, but for
> example our device can recover itself from some FW crashes, thus it'd be
> transparent for user.

Similar things are *supposed* to recoverable with other device, however
this can also sometimes lead to a situation where devices are not usable
anymore, and require a full driver unload / load.

> Whats the logical purpose of module_firmware_crashed? Does it mean fatal
> unrecoverable error on device?

Its just to annotate on the module and kernel that this has happened.

I take it you may agree that, firmware crashing *often* is not good design,
and these issues should be reported to / fixed by vendors. In cases
where driver bugs are reported it is good to see if a firmware crash has
happened before, so that during analysis this is ruled out.

  Luis