netdev - Re: Optimizing kernel compilation / alignments for network performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <510bd08b-3d46-2fc8-3974-9d99fd53430e@gmail.com>
Date:   Fri, 6 May 2022 09:44:35 +0200
From:   Rafał Miłecki <zajec5@...il.com>
To:     Andrew Lunn <andrew@...n.ch>
Cc:     Arnd Bergmann <arnd@...db.de>,
        Alexander Lobakin <alexandr.lobakin@...el.com>,
        Network Development <netdev@...r.kernel.org>,
        linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
        Russell King <linux@...linux.org.uk>,
        Felix Fietkau <nbd@....name>,
        "openwrt-devel@...ts.openwrt.org" <openwrt-devel@...ts.openwrt.org>,
        Florian Fainelli <f.fainelli@...il.com>
Subject: Re: Optimizing kernel compilation / alignments for network
 performance

On 5.05.2022 18:04, Andrew Lunn wrote:
>> you'll see that most used functions are:
>> v7_dma_inv_range
>> __irqentry_text_end
>> l2c210_inv_range
>> v7_dma_clean_range
>> bcma_host_soc_read32
>> __netif_receive_skb_core
>> arch_cpu_idle
>> l2c210_clean_range
>> fib_table_lookup
> 
> There is a lot of cache management functions here. Might sound odd,
> but have you tried disabling SMP? These cache functions need to
> operate across all CPUs, and the communication between CPUs can slow
> them down. If there is only one CPU, these cache functions get simpler
> and faster.
> 
> It just depends on your workload. If you have 1 CPU loaded to 100% and
> the other 3 idle, you might see an improvement. If you actually need
> more than one CPU, it will probably be worse.

It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels
more stable now (lower variations). Let me spend some time on more
testing.


FWIW during all my tests I was using:
echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus
that is what I need to get similar speeds across iperf sessions

With
echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus
my NAT speeds were jumping between 4 speeds:
273 Mbps / 315 Mbps / 353 Mbps / 425 Mbps
(every time I started iperf kernel jumped into one state and kept the
  same iperf speed until stopping it and starting another session)

With
echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
my NAT speeds were jumping between 2 speeds:
284 Mbps / 408 Mbps


> I've also found that some Ethernet drivers invalidate or flush too
> much. If you are sending a 64 byte TCP ACK, all you need to flush is
> 64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then
> recycle the buffer, all you need to invalidate is the size of the ACK,
> so long as you can guarantee nothing has touched the memory above it.
> But you need to be careful when implementing tricks like this, or you
> can get subtle corruption bugs when you get it wrong.

That was actually bgmac's initial behaviour, see commit 92b9ccd34a90
("bgmac: pass received packet to the netif instead of copying it"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=92b9ccd34a9053c628d230fe27a7e0c10179910f

I think it was Felix who suggested me to avoid skb_copy*() and it seems
it improved performance indeed.