netdev - net: ethernet: freescale: fec: copybreak handling throughput, dma_sync

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAMeyCbjiVXFkzA5ZyJ5b3N4fotWkzKHVp3J=nT1yWs1v8dmRXA@mail.gmail.com>
Date:   Wed, 1 Jul 2020 20:44:43 +0200
From:   Kegl Rohit <keglrohit@...il.com>
To:     netdev@...r.kernel.org
Cc:     fugang.duan@....com
Subject: net: ethernet: freescale: fec: copybreak handling throughput,
 dma_sync_* optimisations allowed?

We are running a imx6q system using the FEC network interface driver.
Our goal is to process many small multicast raw frames with ~128 bytes
each at a rate of 4kHz. Therefore the driver must be able to cope with
receiving small packet bursts.

Tests with 6*128 byte brusts @ 4kHz interval (~24MBit/s) caused frame
drops from time to time as seen in IEEE_rx_macerr stat increasing.
Therefore the small FEC hardware FIFO overruns.
When adding additional system load via stress -m 4 --vm-bytes 50M the
situation gets worse. This additional load slows down the whole
system, also the NAPI processing of the FEC driver.

Because of this behaviour I created a flamegraph to get an idea of
time spent in the processing pipeline.
Attached is a flame graph with no additional system stress and one
without. I could observe that most time is spent in dma_sync_*() calls
during rx descriptor processing.

Commit https://github.com/torvalds/linux/commit/1b7bde6d659d30f171259cc2dfba8e5dab34e735
introduced copybreak to increase performance. In our test copybreak is
active the whole time because the 128 bytes packet size is below the
256 default limit.
As I can see from the commit the goal is to allocate a new skb with
the minimum size needed, memcpy the contents to it and pass the new
skb up the network stack. This allows it to keep the larger rx
descriptor buffer (~2KB) DMA mapped without allocation of a new
FEC_ENET_RX_FRSIZE (2KB) buffer each time.

fec_enet_copybreak(u32 length, ...) uses

dma_sync_single_for_cpu(&fep->pdev->dev,
fec32_to_cpu(bdp->cbd_bufaddr), FEC_ENET_RX_FRSIZE - fep->rx_align,
DMA_FROM_DEVICE);
if (!swap)
   memcpy(new_skb->data, (*skb)->data, length);

to sync the descriptor data buffer and memcpy the data to the new skb
without calling dma_unmap().
Later in fec_enet_rx_queue() the dma descriptor buffer is synced again
in the opposite direction.

if (is_copybreak) {
  dma_sync_single_for_device(&fep->pdev->dev,
fec32_to_cpu(bdp->cbd_bufaddr),  FEC_ENET_RX_FRSIZE - fep->rx_align,
DMA_FROM_DEVICE);
}

Now the two main questions:
1. Is it necessary to call dma_sync_single_for_cpu for the whole
buffer size (FEC_ENET_RX_FRSIZE - fep->rx_align), wouldn't syncing the
real packet length which is accessed by memcpy be enough?
Like so: dma_sync_single_for_cpu(&fep->pdev->dev,
fec32_to_cpu(bdp->cbd_bufaddr), (u32) length, DMA_FROM_DEVICE);

2. Is dma_sync_single_for_device even necessary? There is no data
passed back to the device because the skb descriptor buffer is not
modified and the fec peripheral does not need any valid data.
The example in https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
states:
 /* CPU should not write to
  * DMA_FROM_DEVICE-mapped area,
  * so dma_sync_single_for_device() is
  * not needed here. It would be required
  * for DMA_BIDIRECTIONAL mapping if
  * the memory was modified.
 */
I am new to the DMA API on ARM. Are these changes regarding cache
flushing,... allowed? These would increase the copybreak throughput by
reducing CPU load.

Download attachment "copybreak_memorystress.png" of type "image/png" (54512 bytes)

Download attachment "copybreak_nostress.png" of type "image/png" (55569 bytes)