lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+h21hr4UcoJK7upNJjG0ibtX7CkF=akxVdrb--1AJn6-z=sUQ@mail.gmail.com>
Date:   Tue, 20 Aug 2019 18:57:10 +0300
From:   Vladimir Oltean <olteanv@...il.com>
To:     Mark Brown <broonie@...nel.org>,
        Hubert Feurstein <h.feurstein@...il.com>, mlichvar@...hat.com,
        Richard Cochran <richardcochran@...il.com>,
        Andrew Lunn <andrew@...n.ch>,
        Florian Fainelli <f.fainelli@...il.com>
Cc:     linux-spi@...r.kernel.org, netdev <netdev@...r.kernel.org>
Subject: Re: [PATCH spi for-5.4 0/5] Deterministic SPI latency with NXP DSPI driver

On Sun, 18 Aug 2019 at 21:26, Vladimir Oltean <olteanv@...il.com> wrote:
>
> This patchset proposes an interface from the SPI subsystem for
> software timestamping SPI transfers. There is a default implementation
> provided in the core, as well as a mechanism for SPI slave drivers to
> check which byte was in fact timestamped post-facto. The patchset also
> adds the first user of this interface (the NXP DSPI driver in TCFQ mode).
>
> The interface is somewhat similar to Hubert Feurstein's proposal for the
> MDIO subsystem: https://lkml.org/lkml/2019/8/16/638
>
> Original cover letter below. Also provided at the end some results with
> an extra test (J - phc2sys using the timestamps taken by the SPI core).
>
> ===========================================================
>
> Continuing the discussion created by Hubert Feurstein around the
> mv88e6xxx driver for MDIO-controlled switches
> (https://lkml.org/lkml/2019/8/2/1364), this patchset takes a similar
> approach for the NXP LS1021A-TSN board, which has a SPI-controlled DSA
> switch (SJA1105).
>
> The patchset is motivated by some experiments done with a logic
> analyzer, trying to understand the source of latency (and especially of
> the jitter). SJA1105 SPI messages for reading the PTP clock are 12 bytes
> in length: 4 for the SPI header and 8 for the timestamp. When looking at
> the messages with a scope, there's jitter basically everywhere: between
> bits of a frame and between frames in a transfer. The inter-bit jitter
> is hardware and impacts us to a lesser extend (is smaller and caused by
> the PVT stability of the oscillators, PLLs, etc). We will focus on the
> latency between consecutive SPI frames within a 12-byte transfer.
>
> As a preface, revisions of the DSPI controller IP are integrated in many
> Freescale/NXP devices. As a result, the driver has 3 modes of operation:
> - TCFQ (Transfer Complete Flag mode): The controller signals software
>   that data has been sent/received after each individual word.
> - EOQ (End of Queue mode): The driver can implement batching by making
>   use of the controller's 4-word deep FIFO.
> - DMA (Direct Memory Access mode): The SPI controller's FIFO is no
>   longer in direct interaction with the driver, but is used to trigger
>   the RX and TX channels of the eDMA module on the SoC.
>
> In LS1021A, the driver works in the least efficient mode of the 3
> (TCFQ). There is a well-known errata that the DSPI controller is broken
> in conjunction with the eDMA module. As for the EOQ mode, I have tried
> unsuccessfully for a few days to make use of the 4 entry FIFO, and the
> hardware simply fails to reliably acknowledge the transmission when the
> FIFO gets full. So it looks like we're stuck with the TCFQ mode.
>
> The problem with phc2sys on the LS1021A-TSN board is that in order for
> the gettime64() call to complete on the sja1105, the system has to
> service 12 IRQs. Intuitively that is excessive and is the main source of
> jitter, but let's not get ahead of ourselves.
>
> An outline of the experiments that were done (unless otherwise
> mentioned, all of these ran for 120 seconds):
>
> A. First I have measured the (poor) performance of phc2sys under current
>    conditions. (DSPI driver in IRQ mode, no PTP system timestamping)
>
>    offset: min -53310 max 16107 mean -1737.18 std dev 11444.3
>    delay: min 163680 max 237360 mean 201149 std dev 22446.6
>    lost servo lock 1 times
>
> B. I switched the .gettime64 callback to .gettimex64, snapshotting the
>    PTP system timestamp within the sja1105 driver.
>
>    offset: min -48923 max 64217 mean -904.137 std dev 17358.1
>    delay: min 149600 max 203840 mean 169045 std dev 17993.3
>    lost servo lock 8 times
>
> C. I patched "struct spi_transfer" to contain the PTP system timestamp,
>    and from the sja1105 driver, I passed this structure to be
>    snapshotted by the SPI controller's driver (spi-fsl-dspi). This is
>    the "transfer-level" snapshot.
>
>    offset: min -64979 max 38979 mean -416.197 std dev 15367.9
>    delay: min 125120 max 168320 mean 150286 std dev 17675.3
>    lost servo lock 10 times
>
> D. I changed the placement of the transfer snapshotting within the DSPI
>    driver, from "transfer-level" to "byte-level".
>
>    offset: min -9021 max 7149 mean -0.418803 std dev 3529.81
>    delay: min 7840 max 23920 mean 14493.7 std dev 5982.17
>    lost servo lock 0 times
>
> E. I moved the DSPI driver to poll mode. I went back to collecting the
>    PTP system timestamps from the sja1105 driver (same as B).
>
>    offset: min -4199 max 46643 mean 418.214 std dev 4554.01
>    delay: min 84000 max 194000 mean 99463.2 std dev 12936.5
>    lost servo lock 1 times
>
> F. Transfer-level snapshotting in the DSPI driver (same as C), but in
>    poll mode.
>
>    offset: min -24244 max 1115 mean -230.478 std dev 2297.28
>    delay: min 69440 max 119040 mean 70312.9 std dev 8065.34
>    lost servo lock 1 times
>
> G. Byte-level snapshotting (same as D) but in poll mode.
>
>    offset: min -314 max 288 mean -2.48718 std dev 118.045
>    delay: min 4880 max 6000 mean 5118.63 std dev 507.258
>    lost servo lock 0 times
>
>    This seemed suspiciously good to me, so I let it run for longer
>    (58 minutes):
>
>    offset: min -26251 max 16416 mean -21.8672 std dev 863.416
>    delay: min 4720 max 57280 mean 5182.49 std dev 1607.19
>    lost servo lock 3 times
>
> H. Transfer-level snapshotting (same as F), but with IRQs disabled.
>    This ran for 86 minutes.
>
>    offset: min -1927 max 1843 mean -0.209203 std dev 529.398
>    delay: min 85440 max 93680 mean 88245 std dev 1454.71
>    lost servo lock 0 times
>
> I. Byte-level snapshotting (same as G), but with IRQs disabled.
>    This ran for 102 minutes.
>
>    offset: min -378 max 381 mean -0.0083089 std dev 101.495
>    delay: min 4720 max 5920 mean 5129.38 std dev 154.899
>    lost servo lock 0 times
>
> J. Default snapshotting taken by the SPI core, with the DSPI driver
>    running in poll mode, IRQs enabled. This ran for 274 minutes.
>
>    offset: min -42568 max 44576 mean 2.91646 std dev 947.467
>    delay: min 58480 max 171040 mean 80750.7 std dev 2001.61
>    lost servo lock 3 times
>
> As a result, this patchset proposes the implementation of scenario I.
> The others were done through temporary patches which are not presented
> here due to the difficulty of presenting a coherent git history without
> resorting to reverts etc. The gist of each experiment should be clear
> though.
>
> The raw data is available for dissection at
> https://drive.google.com/open?id=1r9raU9ZeqOqkqts6Lb-ISf5ubLDLP3wk.
> The logic analyzer captures can be opened with a free-as-in-beer program
> provided by Saleae: https://www.saleae.com/downloads/.
>
> In the capture data one can find the MOSI, SCK SPI signals, as well as a
> debug GPIO which was toggled at the same time as the PTP system
> timestamp was taken, to give the viewer an impression of what the
> software is capturing compared to the actual timing of the SPI transfer.
>
> Attached are also some close-up screenshots of transfers where there is
> a clear and huge delay in-between frames of the same 12-byte SPI
> transfer. As it turns out, these were all caused by the CPU getting
> interrupted by some other IRQ. Approaches H and I are the only ones that
> get rid of these glitches. In theory, the byte-level snapshotting should
> be less vulnerable to an IRQ interrupting the SPI transfer (because the
> time window is much smaller) but as the 58 minutes experiment shows, it
> is not immune.
>
> Vladimir Oltean (5):
>   spi: Use an abbreviated pointer to ctlr->cur_msg in
>     __spi_pump_messages
>   spi: Add a PTP system timestamp to the transfer structure
>   spi: spi-fsl-dspi: Use poll mode in case the platform IRQ is missing
>   spi: spi-fsl-dspi: Implement the PTP system timestamping for TCFQ mode
>   spi: spi-fsl-dspi: Disable interrupts and preemption during poll mode
>     transfer
>
>  drivers/spi/spi-fsl-dspi.c | 117 +++++++++++++++++++++++++++++++------
>  drivers/spi/spi.c          |  85 +++++++++++++++++++++++----
>  include/linux/spi/spi.h    |  38 ++++++++++++
>  3 files changed, 210 insertions(+), 30 deletions(-)
>
> --
> 2.17.1
>

To the PTP/DSA people copied,

It's possible that I'm going to get a lot of hate for saying this, but
I think we're all missing the forest for the trees with these
ptp_system_timestamp patches.
I'm going to start off with 4 truisms:
- The best software timestamp is worse than a hardware timestamp
- DSA switches are switches that are connected to their host over Ethernet
- Ethernet has support for hardware timestamping
- The mess of taking precise hardware timestamps is well hidden from the kernel

You might see where I'm going.
Maybe this is all really a DSA-specific problem, and should be treated
as such (kept contained).
The claimed goal is to synchronize the DSA switches' time with phc2sys
to/from something else. The problem is that there is latency for
reading the PHC on a DSA device that is a discrete chip.
What if all we need is just a mini-"phc2sys-over-Ethernet" that runs
on a kernel thread internally to DSA? We say that DSA switches are
"slave" to the "master" netdevice - their PTP clock can be the same.
I am fairly confident that the sja1105 at least can be configured in
hardware to work in this mode. One just needs to enable the CPU port
in its own reachability matrix. None of the switch ports is really a
"CPU port" hardware speaking.
- TX timestamps are taken by installing a management route with a
specified port destination. That destination can be the port the frame
came from (the CPU port) if the above condition is met.
- RX timestamps are taken by the switch for frames matching one of 2
MAC filters. Then a short Ethernet frame containing metadata (RX
timestamp) is created and sent to the CPU port. If I enable RX
timestamping on the CPU port, then every management frame sent from
Linux will also generate an RX timestamp (as well as a TX timestamp,
but that is irrelevant).
I believe something similar should be possible on other hardware as well.

The kernel thread can loop back an Ethernet frame and use the 4
collected timestamps to calculate offset and delay.
The only question is how to manage the sync direction (DSA switch to
master, or vice-versa).
It is assumed that the master netdevice supports hardware timestamping
and has a PHC with lower access jitter. That might be a more common
thing than an MDIO or SPI controller with polled I/O and software
timestamping implemented.

Looking forward to comments. If I'm wrong and we do need to extend the
SPI and MDIO subsystems, then I'd better be wrong in any way we look
at the problem, because the alternative is rather intrusive and
tedious to do right (not to mention, not very reusable).

Thanks,
-Vladimir

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ