netdev - Re: Optimizing kernel compilation / alignments for network performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Fri, 6 May 2022 09:47:41 +0200
From:   Rafał Miłecki <zajec5@...il.com>
To:     Felix Fietkau <nbd@....name>, Andrew Lunn <andrew@...n.ch>
Cc:     Arnd Bergmann <arnd@...db.de>,
        Alexander Lobakin <alexandr.lobakin@...el.com>,
        Network Development <netdev@...r.kernel.org>,
        linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
        Russell King <linux@...linux.org.uk>,
        "openwrt-devel@...ts.openwrt.org" <openwrt-devel@...ts.openwrt.org>,
        Florian Fainelli <f.fainelli@...il.com>
Subject: Re: Optimizing kernel compilation / alignments for network
 performance

On 5.05.2022 18:46, Felix Fietkau wrote:
> 
> On 05.05.22 18:04, Andrew Lunn wrote:
>>> you'll see that most used functions are:
>>> v7_dma_inv_range
>>> __irqentry_text_end
>>> l2c210_inv_range
>>> v7_dma_clean_range
>>> bcma_host_soc_read32
>>> __netif_receive_skb_core
>>> arch_cpu_idle
>>> l2c210_clean_range
>>> fib_table_lookup
>>
>> There is a lot of cache management functions here. Might sound odd,
>> but have you tried disabling SMP? These cache functions need to
>> operate across all CPUs, and the communication between CPUs can slow
>> them down. If there is only one CPU, these cache functions get simpler
>> and faster.
>>
>> It just depends on your workload. If you have 1 CPU loaded to 100% and
>> the other 3 idle, you might see an improvement. If you actually need
>> more than one CPU, it will probably be worse.
>>
>> I've also found that some Ethernet drivers invalidate or flush too
>> much. If you are sending a 64 byte TCP ACK, all you need to flush is
>> 64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then
>> recycle the buffer, all you need to invalidate is the size of the ACK,
>> so long as you can guarantee nothing has touched the memory above it.
>> But you need to be careful when implementing tricks like this, or you
>> can get subtle corruption bugs when you get it wrong.
> I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
> This seems rather excessive, especially since most people are going to use a MTU of 1500.
> My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes.
> This should significantly reduce the time spent on flushing caches.

Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
configure MTU and add support for frames beyond 8192 byte size"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03

It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).

I do all my testing with
#define BGMAC_RX_MAX_FRAME_SIZE			1536