netdev - Re: Optimizing kernel compilation / alignments for network performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <391ca2d1-6977-0c9b-588c-31ad9bb68c82@gmail.com>
Date:   Tue, 10 May 2022 12:29:32 +0200
From:   Rafał Miłecki <zajec5@...il.com>
To:     Andrew Lunn <andrew@...n.ch>
Cc:     Felix Fietkau <nbd@....name>, Arnd Bergmann <arnd@...db.de>,
        Alexander Lobakin <alexandr.lobakin@...el.com>,
        Network Development <netdev@...r.kernel.org>,
        linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
        Russell King <linux@...linux.org.uk>,
        "openwrt-devel@...ts.openwrt.org" <openwrt-devel@...ts.openwrt.org>,
        Florian Fainelli <f.fainelli@...il.com>
Subject: Re: Optimizing kernel compilation / alignments for network
 performance

On 6.05.2022 14:42, Andrew Lunn wrote:
>>> I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
>>> This seems rather excessive, especially since most people are going to use a MTU of 1500.
>>> My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes.
>>> This should significantly reduce the time spent on flushing caches.
>>
>> Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
>> configure MTU and add support for frames beyond 8192 byte size"):
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03
>>
>> It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).
>>
>> I do all my testing with
>> #define BGMAC_RX_MAX_FRAME_SIZE			1536
> 
> That helps show that cache operations are part of your bottleneck.
> 
> Taking a quick look at the driver. On the receive side:
> 
>                         /* Unmap buffer to make it accessible to the CPU */
>                          dma_unmap_single(dma_dev, dma_addr,
>                                           BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE);
> 
> Here is data is mapped read for the CPU to use it.
> 
> 			/* Get info from the header */
>                          len = le16_to_cpu(rx->len);
>                          flags = le16_to_cpu(rx->flags);
> 
>                          /* Check for poison and drop or pass the packet */
>                          if (len == 0xdead && flags == 0xbeef) {
>                                  netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n",
>                                             ring->start);
>                                  put_page(virt_to_head_page(buf));
>                                  bgmac->net_dev->stats.rx_errors++;
>                                  break;
>                          }
> 
>                          if (len > BGMAC_RX_ALLOC_SIZE) {
>                                  netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n",
>                                             ring->start);
>                                  put_page(virt_to_head_page(buf));
>                                  bgmac->net_dev->stats.rx_length_errors++;
>                                  bgmac->net_dev->stats.rx_errors++;
>                                  break;
>                          }
> 
>                          /* Omit CRC. */
>                          len -= ETH_FCS_LEN;
> 
>                          skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE);
>                          if (unlikely(!skb)) {
>                                  netdev_err(bgmac->net_dev, "build_skb failed\n");
>                                  put_page(virt_to_head_page(buf));
>                                  bgmac->net_dev->stats.rx_errors++;
>                                  break;
>                          }
>                          skb_put(skb, BGMAC_RX_FRAME_OFFSET +
>                                  BGMAC_RX_BUF_OFFSET + len);
>                          skb_pull(skb, BGMAC_RX_FRAME_OFFSET +
>                                   BGMAC_RX_BUF_OFFSET);
> 
>                          skb_checksum_none_assert(skb);
>                          skb->protocol = eth_type_trans(skb, bgmac->net_dev);
> 
> and this is the first access of the actual data. You can make the
> cache actually work for you, rather than against you, to adding a call to
> 
> 	prefetch(buf);
> 
> just after the dma_unmap_single(). That will start getting the frame
> header from DRAM into cache, so hopefully it is available by the time
> eth_type_trans() is called and you don't have a cache miss.


I don't think that analysis is correct.

Please take a look at following lines:
struct bgmac_rx_header *rx = slot->buf + BGMAC_RX_BUF_OFFSET;
void *buf = slot->buf;

The first we do after dma_unmap_single() call is rx->len read. That
actually points to DMA data. There is nothing we could keep CPU busy
with while preteching data.

FWIW I tried adding prefetch(buf); anyway. I didn't change NAT speed by
a single 1 Mb/s. Speed was exactly the same as without prefetch() call.