netdev - Re: Optimizing kernel compilation / alignments for network performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Thu, 5 May 2022 17:42:56 +0200
From:   Rafał Miłecki <zajec5@...il.com>
To:     Arnd Bergmann <arnd@...db.de>
Cc:     Alexander Lobakin <alexandr.lobakin@...el.com>,
        Network Development <netdev@...r.kernel.org>,
        linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
        Russell King <linux@...linux.org.uk>,
        Andrew Lunn <andrew@...n.ch>, Felix Fietkau <nbd@....name>,
        "openwrt-devel@...ts.openwrt.org" <openwrt-devel@...ts.openwrt.org>,
        Florian Fainelli <f.fainelli@...il.com>
Subject: Re: Optimizing kernel compilation / alignments for network
 performance

On 29.04.2022 16:49, Arnd Bergmann wrote:
> On Wed, Apr 27, 2022 at 7:31 PM Rafał Miłecki <zajec5@...il.com> wrote:
>> On 27.04.2022 14:56, Alexander Lobakin wrote:
> 
>> Thank you Alexander, this appears to be helpful! I decided to ignore
>> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
>> manually.
>>
>>
>> 1. Without ce5013ff3bec and with -falign-functions=32
>> 387 Mb/s
>>
>> 2. Without ce5013ff3bec and with -falign-functions=64
>> 377 Mb/s
>>
>> 3. With ce5013ff3bec and with -falign-functions=32
>> 384 Mb/s
>>
>> 4. With ce5013ff3bec and with -falign-functions=64
>> 377 Mb/s
>>
>>
>> So it seems that:
>> 1. -falign-functions=32 = pretty stable high speed
>> 2. -falign-functions=64 = very stable slightly lower speed
>>
>>
>> I'm going to perform tests on more commits but if it stays so reliable
>> as above that will be a huge success for me.
> 
> Note that the problem may not just be the alignment of a particular
> function, but also how different function map into your cache.
> The Cortex-A9 has a 4-way set-associative L1 cache of 16KB, 32KB or
> 64KB, with a line size of 32 bytes. If you are unlucky and you get
> five different functions that are frequently called and are a multiple
> functions are exactly the wrong spacing that they need more than
> four ways, calling them in sequence would always evict the other
> ones. The same could of course happen if the problem is the D-cache
> or the L2.
> 
> Can you try to get a profile using 'perf record' to see where most
> time is spent, in both the slowest and the fastest versions?
> If the instruction cache is the issue, you should see how the hottest
> addresses line up.

Your explanation sounds sane of course.

If you take a look at my old e-mail
ARM router NAT performance affected by random/unrelated commits
https://lkml.org/lkml/2019/5/21/349
https://www.spinics.net/lists/linux-block/msg40624.html

you'll see that most used functions are:
v7_dma_inv_range
__irqentry_text_end
l2c210_inv_range
v7_dma_clean_range
bcma_host_soc_read32
__netif_receive_skb_core
arch_cpu_idle
l2c210_clean_range
fib_table_lookup

Is there a way to optimize kernel for optimal cache usage of selected
(above) functions?


Meanwhile I was testing -fno-reorder-blocks which some OpenWrt folks
reported as worth trying. It's another randomness. It stabilizes NAT
performance across some commits and breaks stability across others.