lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 26 Aug 2020 06:07:18 +0000
From:   Chris Packham <Chris.Packham@...iedtelesis.co.nz>
To:     Heiner Kallweit <hkallweit1@...il.com>,
        "broonie@...nel.org" <broonie@...nel.org>,
        "mpe@...erman.id.au" <mpe@...erman.id.au>,
        "benh@...nel.crashing.org" <benh@...nel.crashing.org>,
        "paulus@...ba.org" <paulus@...ba.org>
CC:     "linux-spi@...r.kernel.org" <linux-spi@...r.kernel.org>,
        "linuxppc-dev@...ts.ozlabs.org" <linuxppc-dev@...ts.ozlabs.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: fsl_espi errors on v5.7.15


On 26/08/20 1:48 pm, Chris Packham wrote:
>
> On 26/08/20 10:22 am, Chris Packham wrote:
>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>
>> <snip>
>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>> identified a couple of deficiencies that may or may not be related 
>>>> to my
>>>> issue.
>>>>
>>>> First I think the 'Transfer done but SPIE_DON isn't set' message 
>>>> can be
>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE 
>>>> register.
>>>> We also write back to it to clear the current events. We re-read it in
>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>> naturally end up in that situation if we're doing a large read. 
>>>> Consider
>>>> the messages for reading a block of data from a spi-nor chip
>>>>
>>>>    tx = READ_OP + ADDR
>>>>    rx = data
>>>>
>>>> We setup the transfer and pump out the tx_buf. The first interrupt 
>>>> goes
>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>> continues until we've received all the data and we finish with 
>>>> ESPI_SPIE
>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>> isn't set.
>>>>
>>>> The other deficiency is that we only get an interrupt when the 
>>>> amount of
>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of 
>>>> the fifo.
>>>>
>>> SPIM_DON will trigger an interrupt once the last characters have been
>>> transferred, and read the remaining characters from the FIFO.
>>
>> The T2080RM that I have says the following about the DON bit
>>
>> "Last character was transmitted. The last character was transmitted 
>> and a new command can be written for the next frame."
>>
>> That does at least seem to fit with my assertion that it's all about 
>> the TX direction. But the fact that it doesn't happen all the time 
>> throws some doubt on it.
>>
>>> I think the reason I'm seeing some variability is because of how fast
>>>> (or slow) the interrupts get processed and how fast the spi-nor 
>>>> chip can
>>>> fill the CPUs rx fifo.
>>>>
>>> To rule out timing issues at high bus frequencies I initially asked
>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>> or even less, then timing shouldn't be an issue.
>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I 
>> would also expect a slower frequency would fit my "DON is for TX" 
>> narrative.
>>> Last relevant functional changes have been done almost 4 years ago.
>>> And yours is the first such report I see. So question is what could 
>>> be so
>>> special with your setup that it seems you're the only one being 
>>> affected.
>>> The scenarios you describe are standard, therefore much more people
>>> should be affected in case of a driver bug.
>> Agreed. But even on my hardware (which may have a latent issue 
>> despite being in the field for going on 5 years) the issue only 
>> triggers under some fairly specific circumstances.
>>> You said that kernel config impacts how frequently the issue happens.
>>> Therefore question is what's the diff in kernel config, and how could
>>> the differences be related to SPI.
>>
>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an 
>> impact but every time I found something that seemed to be having an 
>> impact I've been able to disprove it. I actually think its about how 
>> busy the system is which may or may not affect when we get round to 
>> processing the interrupts.
>>
>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to 
>> occur on the T2080RDB.
>>
>> I've had to add the following to expose the environment as a mtd 
>> partition
>>
>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi 
>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> index ff87e67c70da..fbf95fc1fd68 100644
>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> @@ -116,6 +116,15 @@ flash@0 {
>>                                 compatible = "micron,n25q512ax3", 
>> "jedec,spi-nor";
>>                                 reg = <0>;
>>                                 spi-max-frequency = <10000000>; /* 
>> input clock */
>> +
>> +                               partition@...oot {
>> +                                        reg = <0x00000000 0x00100000>;
>> +                                        label = "u-boot";
>> +                                };
>> +                                partition@...oot-env {
>> +                                        reg = <0x00100000 0x00010000>;
>> +                                        label = "u-boot-env";
>> +                                };
>>                         };
>>                 };
>>
>> And I'm using the following script to poke at the environment 
>> (warning if anyone does try this and the bug hits it can render your 
>> u-boot environment invalid).
>>
>> cat flash/fw_env_test.sh
>> #!/bin/sh
>>
>> generate_fw_env_config()
>> {
>>   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize 
>> name ; do
>>      echo "$dev $size $erasesize $name"
>>      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000 
>> $erasesize" >/flash/fw_env.config
>>   done
>> }
>>
>> cycles=10
>> [ $# -ge 1 ] && cycles=$1
>>
>> generate_fw_env_config
>>
>> fw_printenv -c /flash/fw_env.config
>>
>> dmesg -c >/dev/null
>> x=0
>> while [ $x -lt $cycles ]; do
>>     fw_printenv -c /flash/fw_env.config >/dev/null || break
>>     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>     dmesg -c | grep -q fsl_espi && break;
>>     let x=x+1
>> done
>>
>> echo "Ran $x cycles"
>
> I've also now seen the RX FIFO not empty error on the T2080RDB
>
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>
> With my current workaround of emptying the RX FIFO. It seems 
> survivable. Interestingly it only ever seems to be 1 extra byte in the 
> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>
> fsl_espi ffe110000.spi: tx 70
> fsl_espi ffe110000.spi: rx 03
> fsl_espi ffe110000.spi: Extra RX 00
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: tx 05
> fsl_espi ffe110000.spi: rx 00
> fsl_espi ffe110000.spi: Extra RX 03
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: tx 05
> fsl_espi ffe110000.spi: rx 00
> fsl_espi ffe110000.spi: Extra RX 03
>
> From all the Micron SPI-NOR datasheets I've got access to it is 
> possible to continually read the SR/FSR. But I've no idea why it 
> happens some times and not others.

So I think I've got a reproduction and I think I've bisected the problem 
to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in 
C"). My day is just finishing now so I haven't applied too much scrutiny 
to this result. Given the various rabbit holes I've been down on this 
issue already I'd take this information with a good degree of skepticism.

Thanks,
Chris

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ