lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMGffE=7pMtOOo2W+TtY84U8F5EQ9f9jRMSDU9kT+4_MOF_dTg@mail.gmail.com>
Date: Thu, 17 Aug 2023 09:25:02 +0200
From: Jinpu Wang <jinpu.wang@...os.com>
To: Michael Chan <michael.chan@...adcom.com>
Cc: Jakub Kicinski <kuba@...nel.org>, netdev <netdev@...r.kernel.org>
Subject: Re: [RFC] bnxt_en TX timeout detected, starting reset task, flapping
 link after

Hi Michael, hi Jakub,

Thx for the help.

On Thu, Aug 17, 2023 at 9:08 AM Michael Chan <michael.chan@...adcom.com> wrote:
>
> On Wed, Aug 16, 2023 at 8:01 PM Jakub Kicinski <kuba@...nel.org> wrote:
> >
> > On Wed, 16 Aug 2023 20:51:25 +0200 Jinpu Wang wrote:
> > > Hi Michael, and folks on the list.
> >
> > It seems you meant to CC Michael.. adding him now.
> > I don't recall anything like this. Could be a bad system...
>
> I agree that it could be a bad NIC or a bad system.
Surprisingly, we had the same symptom, one after another, we suppose
it might be workload specific, once we migrate some workload from
first problematic server to the second server,  4 hours later the
second server also hit same problem. until we disabled some offload
via ethtool, the system became stable again.


>
> >
> > > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251006] bnxt_en
> > > 0000:45:00.0 eth0: [0]: tx{fw_ring: 0 prod: 1e7 cons: 1e4}
>
> TX ring 0 is timing out with prod ahead of cons.
>
> > > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251015] bnxt_en
> > > 0000:45:00.0 eth0: [2]: tx{fw_ring: 2 prod: af cons: 9b}
>
> TX ring 2 is also timing out.
>
> > > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251022] bnxt_en
> > > 0000:45:00.0 eth0: [4]: tx{fw_ring: 4 prod: d4 cons: d2}
>
> Same for TX ring 4.
>
> > > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251027] bnxt_en
> > > 0000:45:00.0 eth0: [6]: tx{fw_ring: 6 prod: 63 cons: 120}
>
> TX ring 6 is ahead by a lot.
>
> > > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326019.874938] bnxt_en
> > > 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
> > > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326019.884991] bnxt_en
> > > 0000:45:00.0 eth0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
>
> This means that during reset, we're timing out when trying to free the
> TX ring (type 1).  There are exactly 4 of these type 1 ring free
> errors, probably matching the 4 TX rings that timed out.  There are
> also 7 type 2 (RX ring) errors.  This makes some sense because by
> default there are usually 2 RX rings sharing the same MSIX with 1 TX
> ring.  So 7 out of 8 RX rings associated with the TX rings also failed
> to be freed.
>
> > >
> > > I checked git history, but can't find any bugfix related to it. The
> > > internet tells me it could be a
> > > firmware bug, but I can't find firmware from Broadcom site or supermicro site.
> > >
>
> I will have someone reach out to you to help with newer firmware.
That will be great.

Could Broadcom add bnxt_en firmware also to linux-firmware like bnx2?
that will ease people's life like me.

Thx again.

  Thanks.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ