[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<VI2PR04MB111476CEA8FFF6F79326FA20AE8DEA@VI2PR04MB11147.eurprd04.prod.outlook.com>
Date: Wed, 26 Nov 2025 09:51:48 +0000
From: Carlos Song <carlos.song@....com>
To: Marc Kleine-Budde <mkl@...gutronix.de>
CC: "broonie@...nel.org" <broonie@...nel.org>, Frank Li <frank.li@....com>,
"hawnguo@...nel.org" <hawnguo@...nel.org>, "s.hauer@...gutronix.de"
<s.hauer@...gutronix.de>, "kernel@...gutronix.de" <kernel@...gutronix.de>,
"festevam@...il.com" <festevam@...il.com>, "imx@...ts.linux.dev"
<imx@...ts.linux.dev>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "linux-arm-kernel@...ts.infradead.org"
<linux-arm-kernel@...ts.infradead.org>, "linux-spi@...r.kernel.org"
<linux-spi@...r.kernel.org>
Subject: RE: [EXT] Re: [PATCH 5/6] spi: imx: support dynamic burst length for
ECSPI DMA mode
> -----Original Message-----
> From: Marc Kleine-Budde <mkl@...gutronix.de>
> Sent: Wednesday, November 26, 2025 4:31 PM
> To: Carlos Song <carlos.song@....com>
> Cc: broonie@...nel.org; Frank Li <frank.li@....com>; hawnguo@...nel.org;
> s.hauer@...gutronix.de; kernel@...gutronix.de; festevam@...il.com;
> imx@...ts.linux.dev; linux-kernel@...r.kernel.org;
> linux-arm-kernel@...ts.infradead.org; linux-spi@...r.kernel.org
> Subject: [EXT] Re: [PATCH 5/6] spi: imx: support dynamic burst length for ECSPI
> DMA mode
>
> On 26.11.2025 07:42:36, Carlos Song wrote:
>
> [...]
>
> > > > + if (bytes_per_word == 1)
> > > > + swab32s(temp + i);
> > > > + else if (bytes_per_word == 2)
> > > > + swahw32s(temp + i);
> > >
> > > Why do you do first a swap in place and then a memcpy()
> > >
> > When dynamic burst enabled, DMA copy data always using buswidth=4 bytes.
> > CPU is little endian so bytes order actually little endian in dma_buf.
> > But for bytes_per_word=1, every bytes should be the same order with mem
> order, it should be big endian so swap every bytes.
> > In the same reason, bytes per word = 2, swap half word.
> > Bytes per word = 4 don't need do any thing.
> > (SPI is not ruled bytes order, so SPI bytes order always follow CPU
> > bytes order, here still follow)
>
> Thanks for the explanation. I think my question was not completely clear. I want
> to know why you touch every byte twice, first you do the swap on the existing
> buffer, then you do a memcpy(). You might do both operations in one go, i.e.
> read the bytes from the original buffer and write them swapped to the bounce
> buffer.
>
Hi, Marc
If CPU is big endian, we don't touch this bytes twice.
But CPU is little endian, we must touch this bytes twice. This is needed to make right bytes order.
Then if RX trim and tx bytes relocate are needed, we also can not combine 2 steps to 1 step.
> > > > + }
> > > > + }
> > > > +#endif
> > > > +
> > > > + if (dma_data->data_len % BYTES_PER_32BITS_WORD
> && !word_delay) {
> > >
> > > I think this deserves a comment, why you do a re-alignment of the data
> here.
> > >
> > Yes. I can add one comment here.
> >
> > In fact it is not re-alignment, it is trim data.
> > When dynamic burst enabled, DMA copy data always using bus width=4 bytes.
> > So DMA always get 4 bytes data from RXFIFO. But if data lens is not
> > 4-byte alignment, the data in the DMA bounce buffer contains extra garbage
> bytes, so it needs to be trimmed before memcopy to xfer buffer.
> >
> > Why is the first word?
> > It is from HW behavior. When dynamic burst enabled, BURST_LENGTH will
> > be set same with actual data len, It helps maintain correct bit count.
> >
> > As RM shows:
> > "
> > In master mode, it controls the number of bits per SPI burst. Since
> > the shift register always loads 32-bit data from transmit FIFO, only
> > the n least-significant (n = BURST LENGTH + 1) will be shifted out. The
> remaining bits will be ignored.
> >
> > Number of Valid Bits in a SPI burst.
> >
> > 0x000 A SPI burst contains the 1 LSB in a word.
> > 0x001 A SPI burst contains the 2 LSB in a word.
> > 0x002 A SPI burst contains the 3 LSB in a word.
> > ...
> > 0x01F A SPI burst contains all 32 bits in a word.
> > 0x020 A SPI burst contains the 1 LSB in first word and all 32 bits in second
> word.
> > 0x021 A SPI burst contains the 2 LSB in first word and all 32 bits in second
> word.
> > "
> > When data len is not 4 bytes-align, so the first word is always include some
> garbage bytes(if transfer 7 bytes. first word include one garbage byte and 3 valid
> bytes, four bytes in second word).
> >
> > > > + unaligned = dma_data->data_len % BYTES_PER_32BITS_WORD;
> > > > + copy_ptr = (u8 *)dma_data->dma_rx_buf +
> > > BYTES_PER_32BITS_WORD - unaligned;
> > > > + } else {
> > > > + copy_ptr = dma_data->dma_rx_buf;
> > >
> > > Why do you use the bounce buffer if the data is aligned correct?
> > >
> > Whatever data is 4 bytes align, when CPU is little endian, bytes swap should be
> done additionally according to different bytes per word setting.
>
> But for 32 bits per word or word delay you do a not needed bounce buffers and
> memcpy()?
>
Yes, awesome! you are totally right! I I've also considered this.
If do this like what you say, a signal DMA logic code should be create from DMA start map to transfer finished.
I should provide one DMA xfer function for bits_per_word=32 and word delay(no bounce buffer).
Then provide another DMA xfer function for bits_per_word=8/16 (with bounce buffer).
Actually it will not provide any previous improvement, but the drivers become complex with more codes.
I prefer maintain a set of simple and common code.
So I didn't do this.
> We still need the bounce buffer for length not multiple of 4, because a direct
> DMA write would overflow the destination buffer, right? And of course bounce
> buffers for LE 8 and 16 bit per word for the byte swapping.
>
Yes. totally right!
> > Summary whole solution about dynamic burst for DMA mode:
> > 1. Always read/write SPI FIFO with DMA buswidth = 4, so DMA bounce buffer
> always 4-bytes alignment:
> > swap bytes/half word according to bytes per word=8/16 when in CPU little
> endian.
> > 2. BURST_LENGTH setting is important, it helps maintain correct bit count(HW
> trim: don't shift out bits to TXFIFO also don't shift in bits in RXFIFO):
> > * TX: Although DMA put 4 byte-alignment data to FIFO, but in bounce
> buffer we put valid data in valid LSB of first word, it can make sure ECSPI only
> shift out valid data in bonus buffer.
> > * RX: Although DMA get 4bytes- alignment data from RXFIFO to bounce
> buffer, but trim it with valid LSB according actual xfer->len, it can make rx_buf is
> right data.
> >
> > > > + }
> > > > +
> > > > + memcpy(rx_buf, copy_ptr, dma_data->data_len); }
> > > > +
> > > > +static int spi_imx_dma_map(struct spi_imx_data *spi_imx,
> > > > + struct dma_data_package *dma_data) {
> > > > + struct spi_controller *controller = spi_imx->controller;
> > > > + struct device *tx_dev = controller->dma_tx->device->dev;
> > > > + struct device *rx_dev = controller->dma_rx->device->dev;
> > > > +
> > > > + dma_data->dma_tx_addr = dma_map_single(tx_dev,
> > > dma_data->dma_tx_buf,
> > > > +
> > > DMA_CACHE_ALIGNED_LEN(dma_data->dma_len),
> > > > + DMA_TO_DEVICE);
> > > > + if (dma_mapping_error(tx_dev, dma_data->dma_tx_addr)) {
> > > > + dev_err(spi_imx->dev, "DMA TX map failed\n");
> > > > + return -EINVAL;
> > >
> > > please propagate the error value of dma_mapping_error()
> > >
> >
> > Will do this in V2
> > > > + }
> > > > +
> > > > + dma_data->dma_rx_addr = dma_map_single(rx_dev,
> > > dma_data->dma_rx_buf,
> > > > +
> > > DMA_CACHE_ALIGNED_LEN(dma_data->dma_len),
> > > > + DMA_FROM_DEVICE);
> > > > + if (dma_mapping_error(rx_dev, dma_data->dma_rx_addr)) {
> > > > + dev_err(spi_imx->dev, "DMA RX map failed\n");
> > > > + goto rx_map_err;
> > >
> > > there's only one user of the dma_unmap_single(), so no need for the goto.
> > >
> > This goto is to unmap previous TX, not this RX. TX has been mapped
> > then start to map RX, now RX mapping error, Do we really don't need to
> rollback for TX?
>
> Sorry there was a misunderstanding, I mean you should directly call
> dma_unmap_single() and remove the goto.
>
> >
> > > > + }
> > > > +
> > > > + return 0;
> > > > +
> > > > +rx_map_err:
> > > > + dma_unmap_single(tx_dev, dma_data->dma_tx_addr,
> > > > + DMA_CACHE_ALIGNED_LEN(dma_data->dma_len),
> > > > + DMA_TO_DEVICE);
> > > > + return -EINVAL;
> > > > +}
> > > > +
> > > > +static int spi_imx_dma_tx_data_handle(struct spi_imx_data *spi_imx,
> > > > + struct dma_data_package *dma_data,
> > > > + const void *tx_buf,
> > > > + bool word_delay)
> > > > +{
> > > > +#ifdef __LITTLE_ENDIAN
> > > > + unsigned int bytes_per_word =
> > > spi_imx_bytes_per_word(spi_imx->bits_per_word);
> > > > + u32 *temp;
> > > > +#endif
> > >
> > > move into scope of if()
> > >
> > Will do this in V2.
> > > > + void *copy_ptr;
> > > > + int unaligned;
> > > > +
> > > > + if (word_delay) {
> > > > + dma_data->dma_len = dma_data->data_len;
> > > > + } else {
> > > > + /*
> > > > + * As per the reference manual, when burst length = 32*n + m
> > > > +bits,
> > > ECSPI
> > > > + * sends m LSB bits in the first word, followed by n full 32-bit
> words.
> > > > + * Since actual data may not be 4-byte aligned, allocate DMA
> > > > +TX/RX
> > > buffers
> > > > + * to ensure alignment. For TX, DMA pushes 4-byte aligned
> words
> > > > +to
> > > TXFIFO,
> > > > + * while ECSPI uses BURST_LENGTH settings to maintain correct
> > > > +bit
> > > count.
> > > > + * For RX, DMA receives 32-bit words from RXFIFO; after
> > > > +transfer
> > > completes,
> > > > + * trim the DMA RX buffer and copy the actual data to rx_buf.
> > > > + */
> > >
> > > Ahh, please add the corresponding description for rx.
> > >
> > Will do this in V2.
> > > > + dma_data->dma_len = ALIGN(dma_data->data_len,
> > > BYTES_PER_32BITS_WORD);
> > > > + }
> > > > +
> > > > + dma_data->dma_tx_buf = kmalloc(dma_data->dma_len,
> GFP_KERNEL |
> > > > +__GFP_ZERO);
> > >
> > > kzalloc()?
> > >
> > Yes. Will do this in V2
> > > > + if (!dma_data->dma_tx_buf)
> > > > + return -ENOMEM;
> > > > +
> > > > + dma_data->dma_rx_buf = kmalloc(dma_data->dma_len,
> GFP_KERNEL |
> > > __GFP_ZERO);
> > > > + if (!dma_data->dma_rx_buf) {
> > > > + kfree(dma_data->dma_tx_buf);
> > > > + return -ENOMEM;
> > > > + }
> > > > +
> > > > + if (dma_data->data_len % BYTES_PER_32BITS_WORD
> && !word_delay) {
> > > > + unaligned = dma_data->data_len % BYTES_PER_32BITS_WORD;
> > > > + copy_ptr = (u8 *)dma_data->dma_tx_buf +
> > > BYTES_PER_32BITS_WORD - unaligned;
> > > > + } else {
> > > > + copy_ptr = dma_data->dma_tx_buf;
> > > > + }
> > > > +
> > > > + memcpy(copy_ptr, tx_buf, dma_data->data_len);
> > > > +
> > > > + /*
> > > > + * When word_delay is enabled, DMA transfers an entire word in
> > > > +one
> > > minor loop.
> > > > + * In this case, no data requires additional handling.
> > > > + */
> > > > + if (word_delay)
> > > > + return 0;
> > > > +
> > > > +#ifdef __LITTLE_ENDIAN
> > > > + /*
> > > > + * On little-endian CPUs, adjust byte order:
> > > > + * - Swap bytes when bpw = 8
> > > > + * - Swap half-words when bpw = 16
> > > > + * This ensures correct data ordering for DMA transfers.
> > > > + */
> > > > + temp = dma_data->dma_tx_buf;
> > > > + for (int i = 0; i < DIV_ROUND_UP(dma_data->dma_len,
> > > BYTES_PER_32BITS_WORD); i++) {
> > > > + if (bytes_per_word == 1)
> > > > + swab32s(temp + i);
> > > > + else if (bytes_per_word == 2)
> > > > + swahw32s(temp + i);
> > > > + }
> > > > +#endif
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static int spi_imx_dma_data_prepare(struct spi_imx_data *spi_imx,
> > > > + struct spi_transfer *transfer,
> > > > + bool word_delay)
> > > > +{
> > > > + u32 pre_bl, tail_bl;
> > > > + u32 ctrl;
> > > > + int ret;
> > > > +
> > > > + /*
> > > > + * ECSPI supports a maximum burst of 512 bytes. When xfer->len
> > > > +exceeds
> > > 512
> > > > + * and is not a multiple of 512, a tail transfer is required. In this case,
> > > > + * an extra DMA request is issued for the remaining data.
> > >
> > > Why do you need an extra transfer in this case?
> > >
> >
> > BURST_LEGTH is used for SPI HW to maintain correct bit count. So
> > BURST_LENGTH should update with data length. After DMA request submit,
> > SPI can not update the BURST_LENGTH, when needed, we must split two
> package, update the register then setup second DMA transfer.
> >
> > ECSPI HW can update BURST_LENGTH auto, but it always update this using
> previous value.
> > When len > 512 but len is not 512-unaligned, we need two packages, second
> for tail data.
> > For example len is 512 *3 + 511. So first transfer using BURST_LENGTH
> > = 512 bytes(auto update 3 times), DMA transfer len = 512 *3, second
> > package BURST_LENGTH = 511 bytes, DMA transfer len = 511.(If here we
> > use 512 bytes as BURST_LENGTH, SPI will shift out/in extra bits, it
> > previous isn't acceptable!)
>
> What happens if you keep the Burst Length at 512 and only transfer 511 bytes
> with the DMA engine?
>
> Marc
>
> --
> Pengutronix e.K. | Marc Kleine-Budde |
> Embedded Linux | https://www.pengutronix.de |
> Vertretung Nürnberg | Phone: +49-5121-206917-129 |
> Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-9 |
Powered by blists - more mailing lists