netdev - Re: [net-next 6/6] can: mcp251xfd: mcp251xfd_regmap_crc_read(): work around broken CRC on TBC register

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210510074334.el2yxp3oy2pmbs7d@pengutronix.de>
Date:   Mon, 10 May 2021 09:43:34 +0200
From:   Marc Kleine-Budde <mkl@...gutronix.de>
To:     Patrick Menschel <menschel.p@...teo.de>
Cc:     Drew Fustini <drew@...gleboard.org>, netdev@...r.kernel.org,
        linux-can@...r.kernel.org, Will C <will@...china.cc>
Subject: Re: [net-next 6/6] can: mcp251xfd: mcp251xfd_regmap_crc_read(): work
 around broken CRC on TBC register

On 08.05.2021 18:36:56, Patrick Menschel wrote:
> ### Test conditions ###
> 
> Since I lacked a true stress test, I wrote one for regular tox with
> pytest collection.
> 
> https://gitlab.com/Menschel/socketcan/-/blob/master/tests/test_socketcan.py#L872
> 
> It uses mcp0 and mcp1 which are directly connected.
> No CAN FD, just 500k with regular frames, random id and random data.
> 
> I basically mimic cangen but enhanced with a queue that handles to the
> rx thread what should be compared next.
> 
> ### Extract from dmesg shows no CRC Errors ###
> 
> [   30.930608] CAN device driver interface
> [   30.967349] spi_master spi0: will run message pump with realtime priority
> [   31.054202] mcp251xfd spi0.1 can0: MCP2518FD rev0.0 (-RX_INT
> -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz
> r:17.00MHz e:16.66MHz) successfully initialized.
> [   31.076906] mcp251xfd spi0.0 can1: MCP2518FD rev0.0 (-RX_INT
> -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz
> r:17.00MHz e:16.66MHz) successfully initialized.
> [   31.298969] mcp251xfd spi0.0 mcp0: renamed from can1
> [   31.339864] mcp251xfd spi0.1 mcp1: renamed from can0
> [   33.471889] IPv6: ADDRCONF(NETDEV_CHANGE): mcp0: link becomes ready
> [   34.482260] IPv6: ADDRCONF(NETDEV_CHANGE): mcp1: link becomes ready
> [  215.218979] can: controller area network core
> [  215.219146] NET: Registered protocol family 29
> [  215.261599] can: raw protocol
> [  218.745376] can: isotp protocol
> [  220.931150] NOHZ tick-stop error: Non-RCU local softirq work is
> pending, handler #08!!!
> [  220.931274] NOHZ tick-stop error: Non-RCU local softirq work is
> pending, handler #08!!!
> [  220.931395] NOHZ tick-stop error: Non-RCU local softirq work is
> pending, handler #08!!!
> [  220.931518] NOHZ tick-stop error: Non-RCU local softirq work is
> pending, handler #08!!!
> [  220.931643] NOHZ tick-stop error: Non-RCU local softirq work is
> pending, handler #08!!!
> [  220.931768] NOHZ tick-stop error: Non-RCU local softirq work is
> pending, handler #08!!!
> [  220.931893] NOHZ tick-stop error: Non-RCU local softirq work is
> pending, handler #08!!!
> [  222.099822] NOHZ tick-stop error: Non-RCU local softirq work is
> pending, handler #08!!!
> [  222.099901] NOHZ tick-stop error: Non-RCU local softirq work is
> pending, handler #08!!!
> [  222.100022] NOHZ tick-stop error: Non-RCU local softirq work is
> pending, handler #08!!!
> [  222.330438] can: broadcast manager protocol
> 
> That softirq error has something to do with IsoTp. I was not able to
> trace it back but I have it on multiple boards: pi0w, pi3b, pi3b+.

The softirq error is known and shows up as the mcp251xfd driver raises a
softirq from threaded IRQ context. We're working on fixing this.

> ### Performance ###
> 
> ## v5.10-rpi/backport-performance-improvements ##
> 
> I get about 20000 frames in 2 minutes.
> 
> 2021-05-08 19:00:36 [    INFO] 20336 frames in 0:02:00
> (test_socketcan.py:890)
> 
> 2021-05-08 19:49:34 [    INFO] 20001 frames in 0:02:00
> (test_socketcan.py:890)
> 
> 
> ## regular v5.10 ##
> 
> 2021-05-08 20:19:55 [    INFO] 20000 frames in 0:02:00
> (test_socketcan.py:890)
> 
> 2021-05-08 20:22:40 [    INFO] 19995 frames in 0:02:00
> (test_socketcan.py:890)
> 
> 2021-05-08 20:25:22 [    INFO] 19931 frames in 0:02:00
> (test_socketcan.py:890)
> 
> 
> The numbers are slightly better but I count that as tolerance.

Makes sense. But you have only measured number of frames in a given
time. The raspi SPI driver is highly optimized so the changes in the
driver don't show up in those numbers.

Thanks for testing, I'll send a pull request to the raspi kernel.

If you are interested if there are performance benefits on your raspi,
consider measuring the spent CPU time and the number of SPI interrupts.

Measure CPU time by putting the command "time" in front of your test.
Measure SPI Interrupts by looking at /proc/interrupts before and after
the test. Note: there are SPI host controller interrupts and Interrupts
from the mcp251xfd.

On a raspi you probably only have a hand full of SPI host controller
interrupts, as the raspi driver only uses interrupts for long transfers.
There will be a mcp251xfd interrupt per TX-complete and RX CAN message,
maybe a few less if they overlap.

> I also found that there are cross effects. If I run the same test on
> vcan0 before, the frame count goes down to 13000 instead.

The changes only touch the mcp251xfd driver, if you see a difference
with the vcan driver, it's either a change in the kernel somewhere else
or your test setup is sensitive to something you changed without
noticing (starting condition, ...)

> I also have to admit, that I didn't get any crc errors with regular
> v5.10 during that tests.

The CRC errors the patch works around are CRC errors introduced by a
chip erratum, not by electromagnetic interference. In my observation
these CRC errors show up if the register contents changes while the
register is read. The register that changes most is the timer base
counter register. That register is only read if a CAN bus error is
signaled to user space (and this is maximized by enabling bus error
reporting). If it happens to be a CRC error while reading the TBC
register and the CRC can be "corrected" by flipping the upper most bit,
there will be no error message about any CRC errors.

Long story short. You only notice that this patch works, if in a
situation you had CRC errors on the TBC register (that is CAN errors are
reported to user space), you now have an order of magnitude less CRC
errors than before.

> Do I have to change my test?

No need to.

> I can still update that pi3b+ that runs my micro-hil at work. That was
> the one that occasionally had CRC errors.

Thanks again for testing!

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

Download attachment "signature.asc" of type "application/pgp-signature" (489 bytes)