lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20230927183214.39c2986b@xps-13>
Date: Wed, 27 Sep 2023 18:32:30 +0200
From: Miquel Raynal <miquel.raynal@...tlin.com>
To: Marc Kleine-Budde <mkl@...gutronix.de>
Cc: Wolfgang Grandegger <wg@...ndegger.com>, "David S. Miller"
 <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni
 <pabeni@...hat.com>, Eric Dumazet <edumazet@...gle.com>,
 netdev@...r.kernel.org, linux-can@...r.kernel.org, Jérémie
 Dautheribes <jeremie.dautheribes@...tlin.com>, Thomas Petazzoni
 <thomas.petazzoni@...tlin.com>, sylvain.girard@...com,
 pascal.eberhard@...com, stable@...r.kernel.org
Subject: Re: [PATCH net] can: sja1000: Always restart the Tx queue after an
 overrun

Hi Marc,

mkl@...gutronix.de wrote on Wed, 27 Sep 2023 11:33:32 +0200:

> On 27.09.2023 11:30:16, Marc Kleine-Budde wrote:
> > On 22.09.2023 17:47:27, Miquel Raynal wrote:  
> > > Upstream commit 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with
> > > a soft reset on Renesas SoCs") fixes an issue with Renesas own SJA1000
> > > CAN controller reception: the Rx buffer is only 5 messages long, so when
> > > the bus loaded (eg. a message every 50us), overrun may easily
> > > happen. Upon an overrun situation, due to a possible internal crosstalk
> > > situation, the controller enters a frozen state which only can be
> > > unlocked with a soft reset (experimentally). The solution was to offload
> > > a call to sja1000_start() in a threaded handler. This needs to happen in
> > > process context as this operation requires to sleep. sja1000_start()
> > > basically enters "reset mode", performs a proper software reset and
> > > returns back into "normal mode".
> > > 
> > > Since this fix was introduced, we no longer observe any stalls in
> > > reception. However it was sporadically observed that the transmit path
> > > would now freeze. Further investigation blamed the fix mentioned above,
> > > and especially the reset operation. Reproducing the reset in a loop
> > > helped identifying what could possibly go wrong. The sja1000 is a single
> > > Tx queue device, which leverages the netdev helpers to process one Tx
> > > message at a time. The logic is: the queue is stopped, the message sent
> > > to the transceiver, once properly transmitted the controller sets a
> > > status bit which triggers an interrupt, in the interrupt handler the
> > > transmission status is checked and the queue woken up. Unfortunately, if
> > > an overrun happens, we might perform the soft reset precisely between
> > > the transmission of the buffer to the transceiver and the advent of the
> > > transmission status bit. We would then stop the transmission operation
> > > without re-enabling the queue, leading to all further transmissions to
> > > be ignored.
> > > 
> > > The reset interrupt can only happen while the device is "open", and
> > > after a reset we anyway want to resume normal operations, no matter if a
> > > packet to transmit got dropped in the process, so we shall wake up the
> > > queue. Restarting the device and waking-up the queue is exactly what
> > > sja1000_set_mode(CAN_MODE_START) does. In order to be consistent about
> > > the queue state, we must acquire a lock both in the reset handler and in
> > > the transmit path to ensure serialization of both operations. As the
> > > reset handler might still be called after the transmission of a frame to
> > > the transceiver but before it actually gets transmitted, we must ensure
> > > we don't leak the skb, so we free it (the behavior is consistent, no
> > > matter if there was an skb on the stack or not).  
> > 
> > Can you make use of netif_tx_disable() and netif_wake_queue() in
> > sja1000_reset_interrupt() instead of the lock?  
> 
> ...or netif_tx_lock()/netif_tx_unlock().

As that's also a spinlock behind I guess it would fit. A quick look
does not seem to show any specific constraint in using it, so let's go
for it.

Thanks,
Miquèl

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ