netdev - Re: [EXT] [PATCH 09/15] qed: use new module_firmware

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200512173431.GD11244@42.do-not-panic.com>
Date:   Tue, 12 May 2020 17:34:31 +0000
From:   Luis Chamberlain <mcgrof@...nel.org>
To:     Igor Russkikh <irusskikh@...vell.com>
Cc:     jeyu@...nel.org, akpm@...ux-foundation.org, arnd@...db.de,
        rostedt@...dmis.org, mingo@...hat.com, aquini@...hat.com,
        cai@....pw, dyoung@...hat.com, bhe@...hat.com,
        peterz@...radead.org, tglx@...utronix.de, gpiccoli@...onical.com,
        pmladek@...e.com, tiwai@...e.de, schlad@...e.de,
        andriy.shevchenko@...ux.intel.com, keescook@...omium.org,
        daniel.vetter@...ll.ch, will@...nel.org,
        mchehab+samsung@...nel.org, kvalo@...eaurora.org,
        davem@...emloft.net, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org, Ariel Elior <aelior@...vell.com>,
        GR-everest-linux-l2 <GR-everest-linux-l2@...vell.com>
Subject: Re: [EXT] [PATCH 09/15] qed: use new module_firmware_crashed()

On Tue, May 12, 2020 at 07:23:28PM +0300, Igor Russkikh wrote:
> 
> >> So I think its not a good place to insert this call.
> >> Its hard to find exact good place to insert it in qed.
> > 
> > Is there a way to check if what happened was indeed a fw crash?
> 
> Our driver has two firmwares (slowpath and fastpath).
> For slowpath firmware the way to understand it crashed is to observe command
> response timeout. This is in qed_mcp.c, around "The MFW failed to respond to
> command" traceout.

Ok thanks.

> For fastpath this is tricky, think you may leave the above place as the only
> place to invoke module_firmware_crashed()

So do you mean like the changes below?

diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c b/drivers/net/ethernet/qlogic/qed/qed_debug.c
index f4eebaabb6d0..95cb7da2542e 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_debug.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c
@@ -7906,6 +7906,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void *buffer)
 		rc = qed_dbg_grc(cdev, (u8 *)buffer + offset +
 				 REGDUMP_HEADER_SIZE, &feature_size);
 		if (!rc) {
+			module_firmware_crashed();
 			*(u32 *)((u8 *)buffer + offset) =
 			    qed_calc_regdump_header(cdev, GRC_DUMP,
 						    cur_engine,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index 280527cc0578..a818cf09dccf 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -566,6 +566,7 @@ _qed_mcp_cmd_and_union(struct qed_hwfn *p_hwfn,
 		DP_NOTICE(p_hwfn,
 			  "The MFW failed to respond to command 0x%08x [param 0x%08x].\n",
 			  p_mb_params->cmd, p_mb_params->param);
+		module_firmware_crashed();
 		qed_mcp_print_cpu_info(p_hwfn, p_ptt);
 
 		spin_lock_bh(&p_hwfn->mcp_info->cmd_lock);

> >> One more thing is that AFAIU taint flag gets permanent on kernel, but
> > for
> >> example our device can recover itself from some FW crashes, thus it'd be
> >> transparent for user.
> > 
> > Similar things are *supposed* to recoverable with other device, however
> > this can also sometimes lead to a situation where devices are not usable
> > anymore, and require a full driver unload / load.
> > 
> >> Whats the logical purpose of module_firmware_crashed? Does it mean fatal
> >> unrecoverable error on device?
> > 
> > Its just to annotate on the module and kernel that this has happened.
> > 
> > I take it you may agree that, firmware crashing *often* is not good
> > design,
> > and these issues should be reported to / fixed by vendors. In cases
> > where driver bugs are reported it is good to see if a firmware crash has
> > happened before, so that during analysis this is ruled out.
> 
> Probably, but still I see some misalignment here, in sense that taint is about
> the kernel state, not about a hardware state indication.

The kernel carries the driver though, and the driver / subsystem can
often times act strange when this happens.

> devlink health could really be a much better candidate for such things.

That sounds fantastic, please Cc me on patches! However I still believe
we should register this event in the kernel for support purposes.

  Luis