lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 1 Jul 2020 20:44:43 +0200
From:   Kegl Rohit <keglrohit@...il.com>
To:     netdev@...r.kernel.org
Cc:     fugang.duan@....com
Subject: net: ethernet: freescale: fec: copybreak handling throughput,
 dma_sync_* optimisations allowed?

We are running a imx6q system using the FEC network interface driver.
Our goal is to process many small multicast raw frames with ~128 bytes
each at a rate of 4kHz. Therefore the driver must be able to cope with
receiving small packet bursts.

Tests with 6*128 byte brusts @ 4kHz interval (~24MBit/s) caused frame
drops from time to time as seen in IEEE_rx_macerr stat increasing.
Therefore the small FEC hardware FIFO overruns.
When adding additional system load via stress -m 4 --vm-bytes 50M the
situation gets worse. This additional load slows down the whole
system, also the NAPI processing of the FEC driver.

Because of this behaviour I created a flamegraph to get an idea of
time spent in the processing pipeline.
Attached is a flame graph with no additional system stress and one
without. I could observe that most time is spent in dma_sync_*() calls
during rx descriptor processing.

Commit https://github.com/torvalds/linux/commit/1b7bde6d659d30f171259cc2dfba8e5dab34e735
introduced copybreak to increase performance. In our test copybreak is
active the whole time because the 128 bytes packet size is below the
256 default limit.
As I can see from the commit the goal is to allocate a new skb with
the minimum size needed, memcpy the contents to it and pass the new
skb up the network stack. This allows it to keep the larger rx
descriptor buffer (~2KB) DMA mapped without allocation of a new
FEC_ENET_RX_FRSIZE (2KB) buffer each time.

fec_enet_copybreak(u32 length, ...) uses

dma_sync_single_for_cpu(&fep->pdev->dev,
fec32_to_cpu(bdp->cbd_bufaddr), FEC_ENET_RX_FRSIZE - fep->rx_align,
DMA_FROM_DEVICE);
if (!swap)
   memcpy(new_skb->data, (*skb)->data, length);

to sync the descriptor data buffer and memcpy the data to the new skb
without calling dma_unmap().
Later in fec_enet_rx_queue() the dma descriptor buffer is synced again
in the opposite direction.

if (is_copybreak) {
  dma_sync_single_for_device(&fep->pdev->dev,
fec32_to_cpu(bdp->cbd_bufaddr),  FEC_ENET_RX_FRSIZE - fep->rx_align,
DMA_FROM_DEVICE);
}

Now the two main questions:
1. Is it necessary to call dma_sync_single_for_cpu for the whole
buffer size (FEC_ENET_RX_FRSIZE - fep->rx_align), wouldn't syncing the
real packet length which is accessed by memcpy be enough?
Like so: dma_sync_single_for_cpu(&fep->pdev->dev,
fec32_to_cpu(bdp->cbd_bufaddr), (u32) length, DMA_FROM_DEVICE);

2. Is dma_sync_single_for_device even necessary? There is no data
passed back to the device because the skb descriptor buffer is not
modified and the fec peripheral does not need any valid data.
The example in https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
states:
 /* CPU should not write to
  * DMA_FROM_DEVICE-mapped area,
  * so dma_sync_single_for_device() is
  * not needed here. It would be required
  * for DMA_BIDIRECTIONAL mapping if
  * the memory was modified.
 */
I am new to the DMA API on ARM. Are these changes regarding cache
flushing,... allowed? These would increase the copybreak throughput by
reducing CPU load.

Download attachment "copybreak_memorystress.png" of type "image/png" (54512 bytes)

Download attachment "copybreak_nostress.png" of type "image/png" (55569 bytes)

Powered by blists - more mailing lists