[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fcae2d955b3f43af8d64f1aa50fbc685@AcuMS.aculab.com>
Date:   Thu, 17 Mar 2022 11:10:21 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Vignesh Raghavendra' <vigneshr@...com>,
        'Michael Walle' <michael@...le.cc>
CC:     Tudor Ambarus <tudor.ambarus@...rochip.com>,
        "p.yadav@...com" <p.yadav@...com>,
        "broonie@...nel.org" <broonie@...nel.org>,
        "miquel.raynal@...tlin.com" <miquel.raynal@...tlin.com>,
        "richard@....at" <richard@....at>,
        "linux-mtd@...ts.infradead.org" <linux-mtd@...ts.infradead.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-spi@...r.kernel.org" <linux-spi@...r.kernel.org>,
        "nicolas.ferre@...rochip.com" <nicolas.ferre@...rochip.com>
Subject: RE: [PATCH v2 0/6] spi-mem: Allow specifying the byte order in DTR
 mode
From: Vignesh Raghavendra
> Sent: 17 March 2022 10:24
...
> Modern OSPI/QSPI flash controllers provide MMIO interface to read from
> flash where DMA can pull data as if though you are reading from On chip RAM
So the cpu does an MMIO read cycle to the controller which doesn't
complete until (for the nibble-mode spi device I have):
1) Chipselect is asserted.
2) The 8-bit command has been clocked out.
3) The 32bit address have been clocked out (8 clocks in nibbles).
4) A few (probably 4) extra delay clocks are added.
5) The data is read - 8 clocks for 32bits in nibble mode.
6) Chipselect is removed.
Now you can do long sequential reads without all the red tape.
But a random read in nibble mode is about 30 clocks.
16 bit mode saves 6 clocks for the data and maybe 6 for the address?
The controller could do 'clever stuff' for sequential reads.
At a cost of slowing down random reads.
So even at 400MHz it isn't that fast.
If the MMIO interface to the flash controller is PCIe you can
add in a load of extra latency for the cpu read itself.
While PCIe allows multiple read requests to be outstanding,
the Intel cpu I've looked at serialise the reads from each
cpu core (each cpu always uses the same TLP tag).
Now longer read TLP help a lot (IIRC max is 256 bytes).
But the x86 cpu will only generate read TLP for register reads.
You need to use AVX512 registers (or cache line fetches) to
get better throughput!
The alternative is getting the flash controller to issue
the read/write TLP for memory transfers.
	David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists
 
