lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230710173858.75bc590e@kernel.org>
Date: Mon, 10 Jul 2023 17:38:58 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: Michael Chan <michael.chan@...adcom.com>
Cc: davem@...emloft.net, netdev@...r.kernel.org, edumazet@...gle.com,
 pabeni@...hat.com
Subject: Re: [PATCH net-next 0/3] eth: bnxt: handle invalid Tx completions
 more gracefully

On Mon, 10 Jul 2023 14:44:31 -0700 Michael Chan wrote:
> > bnxt trusts the events generated by the device which may lead to kernel
> > crashes. These are extremely rare but they do happen. For a while
> > I thought crashing may be intentional, because device reporting invalid
> > completions should never happen, and having a core dump could be useful
> > if it does. But in practice I haven't found any clues in the core dumps,
> > and panic_on_warn exists.  
> 
> Indeed, it was intentional to crash the kernel so that we could
> analyze the rings in the core dump.  Typically, we would find a bad
> completion in one of the rings and we would debug it with the hardware
> team during early chip testing.  Either the bug is fixed or some
> suitable workaround is implemented.  Ideally, this should never happen
> once the chip goes into production.

I was suspecting bad HW, but some new platforms seems to be hitting it,
too. Which now makes me suspect PXE -> Linux hand off problem? 
Or multi-host?  Hard to tell..
Hopefully once it's not crashing it will be easier to do more analysis -
crashes within softirq during boot don't propagate too well into
monitoring systems :(

> I suppose in a large enough deployment, this NULL SKB crash can
> happen.  I will review your patchset later today.  Thanks.

Thanks!

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ