linux-kernel - Re: fsl_espi errors on v5.7.15

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <23d13439e4cc1872c29db2f93e715a61f4843943.camel@infinera.com>
Date:   Mon, 7 Sep 2020 09:53:26 +0000
From:   Joakim Tjernlund <Joakim.Tjernlund@...inera.com>
To:     "mpe@...erman.id.au" <mpe@...erman.id.au>,
        "broonie@...nel.org" <broonie@...nel.org>,
        "paulus@...ba.org" <paulus@...ba.org>,
        "npiggin@...il.com" <npiggin@...il.com>,
        "Chris.Packham@...iedtelesis.co.nz" 
        <Chris.Packham@...iedtelesis.co.nz>,
        "benh@...nel.crashing.org" <benh@...nel.crashing.org>,
        "hkallweit1@...il.com" <hkallweit1@...il.com>
CC:     "linuxppc-dev@...ts.ozlabs.org" <linuxppc-dev@...ts.ozlabs.org>,
        "linux-spi@...r.kernel.org" <linux-spi@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: fsl_espi errors on v5.7.15

On Thu, 2020-09-03 at 23:58 +0000, Chris Packham wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
> 
> 
> On 1/09/20 6:14 pm, Nicholas Piggin wrote:
> > Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
> > > On 1/09/20 12:33 am, Heiner Kallweit wrote:
> > > > On 30.08.2020 23:59, Chris Packham wrote:
> > > > > On 31/08/20 9:41 am, Heiner Kallweit wrote:
> > > > > > On 30.08.2020 23:00, Chris Packham wrote:
> > > > > > > On 31/08/20 12:30 am, Nicholas Piggin wrote:
> > > > > > > > Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
> > > > > > > <snip>
> > > > > > > 
> > > > > > > > > > > > > I've also now seen the RX FIFO not empty error on the T2080RDB
> > > > > > > > > > > > > 
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > > > > > > > > > 
> > > > > > > > > > > > > With my current workaround of emptying the RX FIFO. It seems
> > > > > > > > > > > > > survivable. Interestingly it only ever seems to be 1 extra byte in the
> > > > > > > > > > > > > RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > fsl_espi ffe110000.spi: tx 70
> > > > > > > > > > > > > fsl_espi ffe110000.spi: rx 03
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Extra RX 00
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > > > > > > > > > fsl_espi ffe110000.spi: tx 05
> > > > > > > > > > > > > fsl_espi ffe110000.spi: rx 00
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Extra RX 03
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > > > > > > > > > fsl_espi ffe110000.spi: tx 05
> > > > > > > > > > > > > fsl_espi ffe110000.spi: rx 00
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Extra RX 03
> > > > > > > > > > > > > 
> > > > > > > > > > > > >      From all the Micron SPI-NOR datasheets I've got access to it is
> > > > > > > > > > > > > possible to continually read the SR/FSR. But I've no idea why it
> > > > > > > > > > > > > happens some times and not others.
> > > > > > > > > > > > So I think I've got a reproduction and I think I've bisected the problem
> > > > > > > > > > > > to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
> > > > > > > > > > > > C"). My day is just finishing now so I haven't applied too much scrutiny
> > > > > > > > > > > > to this result. Given the various rabbit holes I've been down on this
> > > > > > > > > > > > issue already I'd take this information with a good degree of skepticism.
> > > > > > > > > > > > 
> > > > > > > > > > > OK, so an easy test should be to re-test with a 5.4 kernel.
> > > > > > > > > > > It doesn't have yet the change you're referring to, and the fsl-espi driver
> > > > > > > > > > > is basically the same as in 5.7 (just two small changes in 5.7).
> > > > > > > > > > There's 6cc0c16d82f88 and maybe also other interrupt related patches
> > > > > > > > > > around this time that could affect book E, so it's good if that exact
> > > > > > > > > > patch is confirmed.
> > > > > > > > > My confirmation is basically that I can induce the issue in a 5.4 kernel
> > > > > > > > > by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
> > > > > > > > > 5.9-rc2 by reverting that one commit.
> > > > > > > > > 
> > > > > > > > > I both cases it's not exactly a clean cherry-pick/revert so I also
> > > > > > > > > confirmed the bisection result by building at 3282a3da25bd (which sees
> > > > > > > > > the issue) and the commit just before (which does not).
> > > > > > > > Thanks for testing, that confirms it well.
> > > > > > > > 
> > > > > > > > [snip patch]
> > > > > > > > 
> > > > > > > > > I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
> > > > > > > > > didn't report anything (either with or without the change above).
> > > > > > > > Okay, it was a bit of a shot in the dark. I still can't see what
> > > > > > > > else has changed.
> > > > > > > > 
> > > > > > > > What would cause this, a lost interrupt? A spurious interrupt? Or
> > > > > > > > higher interrupt latency?
> > > > > > > > 
> > > > > > > > I don't think the patch should cause significantly worse latency,
> > > > > > > > (it's supposed to be a bit better if anything because it doesn't set
> > > > > > > > up the full interrupt frame). But it's possible.
> > > > > > > My working theory is that the SPI_DON indication is all about the TX
> > > > > > > direction an now that the interrupts are faster we're hitting an error
> > > > > > > because there is still RX activity going on. Heiner disagrees with my
> > > > > > > interpretation of the SPI_DON indication and the fact that it doesn't
> > > > > > > happen every time does throw doubt on it.
> > > > > > > 
> > > > > > It's right that the eSPI spec can be interpreted that SPI_DON refers to
> > > > > > TX only. However this wouldn't really make sense, because also for RX
> > > > > > we program the frame length, and therefore want to be notified once the
> > > > > > full frame was received. Also practical experience shows that SPI_DON
> > > > > > is set also after RX-only transfers.
> > > > > > Typical SPI NOR use case is that you write read command + start address,
> > > > > > followed by a longer read. If the TX-only interpretation would be right,
> > > > > > we'd always end up with SPI_DON not being set.
> > > > > > 
> > > > > > > I can't really explain the extra RX byte in the fifo. We know how many
> > > > > > > bytes to expect and we pull that many from the fifo so it's not as if
> > > > > > > we're missing an interrupt causing us to skip the last byte. I've been
> > > > > > > looking for some kind of off-by-one calculation but again if it were
> > > > > > > something like that it'd happen all the time.
> > > > > > > 
> > > > > > Maybe it helps to know what value this extra byte in the FIFO has. Is it:
> > > > > > - a duplicate of the last read byte
> > > > > > - or the next byte (at <end address> + 1)
> > > > > > - or a fixed value, e.g. always 0x00 or 0xff
> > > > > The values were up thread a bit but I'll repeat them here
> > > > > 
> > > > > fsl_espi ffe110000.spi: tx 70
> > > > > fsl_espi ffe110000.spi: rx 03
> > > > > fsl_espi ffe110000.spi: Extra RX 00
> > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > fsl_espi ffe110000.spi: tx 05
> > > > > fsl_espi ffe110000.spi: rx 00
> > > > > fsl_espi ffe110000.spi: Extra RX 03
> > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > fsl_espi ffe110000.spi: tx 05
> > > > > fsl_espi ffe110000.spi: rx 00
> > > > > fsl_espi ffe110000.spi: Extra RX 03
> > > > > 
> > > > > 
> > > > > The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
> > > > > either a READ_SR or a READ_FSR. Never a data read.
> > > > > 
> > > > Just remembered something about SPIE_DON:
> > > > Transfers are always full duplex, therefore in case of a read the chip
> > > > sends dummy zero's. Having said that in case of a read SPIE_DON means
> > > > that the last dummy zero was shifted out.
> > > > 
> > > > READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
> > > > So the issue may have a dependency on the length of the transfer.
> > > > However I see no good explanation so far. You can try adding a delay of
> > > > a few miroseconds between the following to commands in fsl_espi_bufs().
> > > > 
> > > >     fsl_espi_write_reg(espi, ESPI_SPIM, mask);
> > > > 
> > > >     /* Prevent filling the fifo from getting interrupted */
> > > >     spin_lock_irq(&espi->lock);
> > > > 
> > > > Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.
> > > I think this might be heading in the right direction. Playing about with
> > > a delay does seem to make the two symptoms less likely. Although I have
> > > to set it quite high (i.e. msleep(100)) to completely avoid any
> > > possibility of seeing either message.
> > The patch might replay the interrupt a little bit faster, but it would
> > be a few microseconds at most I think (just from improved code).
> > 
> > Would you be able to ftrace the interrupt handler function and see if you
> > can see a difference in number or timing of interrupts? I'm at a bit of
> > a loss.
> 
> I tried ftrace but I really wasn't sure what I was looking for.
> Capturing a "bad" case was pretty tricky. But I think I've identified a
> fix (I'll send it as a proper patch shortly). The gist is
> 
> diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
> index 7e7c92cafdbb..cb120b68c0e2 100644
> --- a/drivers/spi/spi-fsl-espi.c
> +++ b/drivers/spi/spi-fsl-espi.c
> @@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi
> *espi, u32 events)
>   static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
>   {
>          struct fsl_espi *espi = context_data;
> -       u32 events;
> +       u32 events, mask;
> 
>          spin_lock(&espi->lock);
> 
>          /* Get interrupt events(tx/rx) */
>          events = fsl_espi_read_reg(espi, ESPI_SPIE);
> -       if (!events) {
> +       mask = fsl_espi_read_reg(espi, ESPI_SPIM);
> +       if (!(events & mask)) {
>                  spin_unlock(&espi->lock);
>                  return IRQ_NONE;
>          }
> 
> The SPIE register contains the TXCNT so events is pretty much always
> going to have something set. By checking events against what we've
> actually requested interrupts for we don't see any spurious events.
> 
> I've tested this on the T2080RDB and on our custom hardware and it seems
> to resolve the problem.
> 

I looked at the fsl_espi_irq() too and noticed that clearing of the IRQ events
are after processing TX/RX. That looks a bit odd to me.

  Jocke