linux-kernel - Re: [PATCH v4 2/6] spi: spi-fsl-dspi: Store status directly in cur

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250630204135.gzffv33j3pk3bgx6@skbuf>
Date: Mon, 30 Jun 2025 23:41:35 +0300
From: Vladimir Oltean <vladimir.oltean@....com>
To: James Clark <james.clark@...aro.org>
Cc: Vladimir Oltean <olteanv@...il.com>, Mark Brown <broonie@...nel.org>,
	Arnd Bergmann <arnd@...db.de>,
	Larisa Grigore <larisa.grigore@....com>,
	Frank Li <Frank.li@....com>, Christoph Hellwig <hch@....de>,
	linux-spi@...r.kernel.org, imx@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 2/6] spi: spi-fsl-dspi: Store status directly in
 cur_msg->status

On Mon, Jun 30, 2025 at 01:54:11PM +0100, James Clark wrote:
> On 27/06/2025 10:30 pm, Vladimir Oltean wrote:
> > On Fri, Jun 27, 2025 at 11:21:38AM +0100, James Clark wrote:
> > > This will allow us to return a status from the interrupt handler in a
> > > later commit and avoids copying it at the end of
> > > dspi_transfer_one_message(). For consistency make polling and DMA modes
> > > use the same mechanism.
> > > 
> > > Refactor dspi_rxtx() and dspi_poll() to not return -EINPROGRESS because
> > > this isn't actually a status that was ever returned to the core layer
> > > but some internal state. Wherever that was used we can look at dspi->len
> > > instead.
> > > 
> > > No functional changes intended.
> > > 
> > > Signed-off-by: James Clark <james.clark@...aro.org>
> > > ---
> > 
> > This commit doesn't work, please do not merge this patch.
> > 
> > You are changing the logic in DMA mode, interrupt-based FIFO and PIO all
> > in one go, in a commit whose title and primary purpose is unrelated to
> > that. Just a mention of the type "while at it, also do that". And in
> > that process, that bundled refactoring introduces a subtle, but severe bug.
> > 
> > No, that is discouraged. Make one patch per logical change, where only
> > one thing is happening and which is obviously correct. It helps you and
> > it helps the reviewer.
> > 
> > Please find attached a set of 3 patches that represent a broken down and
> > corrected variant of this one. First 2 should be squashed together in
> > your next submission, they are just to illustrate the bug that you've
> > introduced (which can be reproduced on any SoC in XSPI mode).
> > 
> 
> Thanks for the debugging, yes it looks like the patches could be broken down
> a bit.
> 
> Just for clarity, is this bug affecting host+polling mode? I can see the
> logic bug in dspi_poll() which I must have tested less thoroughly, but I
> can't actually see any difference in dspi_interrupt().

It should affect both, I tested your patches unmodified, i.e. interrupt
based XSPI FIFO mode (in master mode).

Assume (not real numbers, just for explanation's sake) dspi->len is 2
(2 FIFO sizes worth of 32-bit words, but let's assume for simplicity
that each dspi_pop_tx() call simply decrements the len by 1).

The correct behavior would be this:

dspi_transfer_one_message()
-> dspi->len = 2
-> dspi_fifo_write()
   -> dspi_xspi_fifo_write()
      -> dspi_pop_tx()
         -> dspi->len = 1
-> wait_for_completion(&dspi->xfer_done)
                                           <IRQ>
                                           dspi_interrupt()
                                           -> dspi_rxtx()
                                              -> dspi_fifo_read()
                                              -> dspi_fifo_write()
                                                 -> dspi_xspi_fifo_write()
                                                    -> dspi_pop_tx()
                                                       -> dspi->len = 0
                                           <IRQ>
                                           dspi_interrupt()
                                           -> dspi_rxtx()
                                              -> dspi_fifo_read()
                                           -> complete(&dspi->xfer_done)
-> reinit_completion(&dspi->xfer_done)

but the behavior with your proposed logic is this:

dspi_transfer_one_message()
-> dspi->len = 2
-> dspi_fifo_write()
   -> dspi_xspi_fifo_write()
      -> dspi_pop_tx()
         -> dspi->len = 1
-> wait_for_completion(&dspi->xfer_done)
                                           <IRQ>
                                           dspi_interrupt()
                                           -> dspi_rxtx()
                                              -> dspi_fifo_read()
                                              -> dspi_fifo_write()
                                                 -> dspi_xspi_fifo_write()
                                                    -> dspi_pop_tx()
                                                       -> dspi->len = 0
                                           -> complete(&dspi->xfer_done)
-> reinit_completion(&dspi->xfer_done)
                                           <IRQ>
                                           dspi_interrupt()
                                           -> Second interrupt is spurious at
                                              this point, since the process
                                              context may have proceeded
                                              to change pointers in
                                              dspi->cur_transfer, etc.

Clearer now? Essentially the complete() call is premature, it needs to
be not after the dspi_fifo_write() call, but after its subsequent
dspi_fifo_read(), which comes after yet another IRQ, in the IRQ-triggered
path.

Not sure why you are not able to reproduce this, maybe luck had it that
the complete() call never woke up the process context earlier than the
second IRQ in the above case triggered?

I'm not doing anything special in particular, just booted a board with a
SPI device driver (sja1105). This transfers some sequences of relatively
large buffers (256 bytes) at probe time, maybe that exercises the
controller driver more than the average peripheral driver.