linux-kernel - Re: fsl_espi errors on v5.7.15

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <88a92234-450b-f559-d112-31d5c996c256@alliedtelesis.co.nz>
Date:   Tue, 1 Sep 2020 02:52:37 +0000
From:   Chris Packham <Chris.Packham@...iedtelesis.co.nz>
To:     Heiner Kallweit <hkallweit1@...il.com>,
        Nicholas Piggin <npiggin@...il.com>,
        "benh@...nel.crashing.org" <benh@...nel.crashing.org>,
        "broonie@...nel.org" <broonie@...nel.org>,
        "mpe@...erman.id.au" <mpe@...erman.id.au>,
        "paulus@...ba.org" <paulus@...ba.org>
CC:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linuxppc-dev@...ts.ozlabs.org" <linuxppc-dev@...ts.ozlabs.org>,
        "linux-spi@...r.kernel.org" <linux-spi@...r.kernel.org>
Subject: Re: fsl_espi errors on v5.7.15


On 1/09/20 12:33 am, Heiner Kallweit wrote:
> On 30.08.2020 23:59, Chris Packham wrote:
>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>> <snip>
>>>>
>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>>
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>
>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>>
>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>
>>>>>>>>>>     From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>>> happens some times and not others.
>>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>>>
>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>>> patch is confirmed.
>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>
>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>>>> the issue) and the commit just before (which does not).
>>>>> Thanks for testing, that confirms it well.
>>>>>
>>>>> [snip patch]
>>>>>
>>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>>>> didn't report anything (either with or without the change above).
>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>> else has changed.
>>>>>
>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>> higher interrupt latency?
>>>>>
>>>>> I don't think the patch should cause significantly worse latency,
>>>>> (it's supposed to be a bit better if anything because it doesn't set
>>>>> up the full interrupt frame). But it's possible.
>>>> My working theory is that the SPI_DON indication is all about the TX
>>>> direction an now that the interrupts are faster we're hitting an error
>>>> because there is still RX activity going on. Heiner disagrees with my
>>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>>> happen every time does throw doubt on it.
>>>>
>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>>> TX only. However this wouldn't really make sense, because also for RX
>>> we program the frame length, and therefore want to be notified once the
>>> full frame was received. Also practical experience shows that SPI_DON
>>> is set also after RX-only transfers.
>>> Typical SPI NOR use case is that you write read command + start address,
>>> followed by a longer read. If the TX-only interpretation would be right,
>>> we'd always end up with SPI_DON not being set.
>>>
>>>> I can't really explain the extra RX byte in the fifo. We know how many
>>>> bytes to expect and we pull that many from the fifo so it's not as if
>>>> we're missing an interrupt causing us to skip the last byte. I've been
>>>> looking for some kind of off-by-one calculation but again if it were
>>>> something like that it'd happen all the time.
>>>>
>>> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
>>> - a duplicate of the last read byte
>>> - or the next byte (at <end address> + 1)
>>> - or a fixed value, e.g. always 0x00 or 0xff
>> The values were up thread a bit but I'll repeat them here
>>
>> fsl_espi ffe110000.spi: tx 70
>> fsl_espi ffe110000.spi: rx 03
>> fsl_espi ffe110000.spi: Extra RX 00
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: tx 05
>> fsl_espi ffe110000.spi: rx 00
>> fsl_espi ffe110000.spi: Extra RX 03
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: tx 05
>> fsl_espi ffe110000.spi: rx 00
>> fsl_espi ffe110000.spi: Extra RX 03
>>
>>
>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
>> either a READ_SR or a READ_FSR. Never a data read.
>>
> Just remembered something about SPIE_DON:
> Transfers are always full duplex,

That's not true in rxskip mode. Setting rxskip forces the eSPI into 
half-duplex mode. So far all the instances of "extra" rx bytes have been 
when rxskip == 1.

I think that could be where our experience with SPIE_DON differ. In 
full-duplex mode yes DON is always set. In half duplex mode we can end 
up with DON set or not set depending on the interrupt timing.

> therefore in case of a read the chip
> sends dummy zero's. Having said that in case of a read SPIE_DON means
> that the last dummy zero was shifted out.
>
> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
> So the issue may have a dependency on the length of the transfer.
> However I see no good explanation so far. You can try adding a delay of
> a few miroseconds between the following to commands in fsl_espi_bufs().
>
> 	fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>
> 	/* Prevent filling the fifo from getting interrupted */
> 	spin_lock_irq(&espi->lock);
>
> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.