[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK8P3a2tA8vkB-G-sQdvoiB8Pj08LRn_Vhf7qT-YdBJQwaGhaA@mail.gmail.com>
Date: Fri, 29 Apr 2022 16:49:22 +0200
From: Arnd Bergmann <arnd@...db.de>
To: Rafał Miłecki <zajec5@...il.com>
Cc: Alexander Lobakin <alexandr.lobakin@...el.com>,
Network Development <netdev@...r.kernel.org>,
linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
Russell King <linux@...linux.org.uk>,
Andrew Lunn <andrew@...n.ch>, Felix Fietkau <nbd@....name>,
"openwrt-devel@...ts.openwrt.org" <openwrt-devel@...ts.openwrt.org>,
Florian Fainelli <f.fainelli@...il.com>
Subject: Re: Optimizing kernel compilation / alignments for network performance
On Wed, Apr 27, 2022 at 7:31 PM Rafał Miłecki <zajec5@...il.com> wrote:
> On 27.04.2022 14:56, Alexander Lobakin wrote:
> Thank you Alexander, this appears to be helpful! I decided to ignore
> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
> manually.
>
>
> 1. Without ce5013ff3bec and with -falign-functions=32
> 387 Mb/s
>
> 2. Without ce5013ff3bec and with -falign-functions=64
> 377 Mb/s
>
> 3. With ce5013ff3bec and with -falign-functions=32
> 384 Mb/s
>
> 4. With ce5013ff3bec and with -falign-functions=64
> 377 Mb/s
>
>
> So it seems that:
> 1. -falign-functions=32 = pretty stable high speed
> 2. -falign-functions=64 = very stable slightly lower speed
>
>
> I'm going to perform tests on more commits but if it stays so reliable
> as above that will be a huge success for me.
Note that the problem may not just be the alignment of a particular
function, but also how different function map into your cache.
The Cortex-A9 has a 4-way set-associative L1 cache of 16KB, 32KB or
64KB, with a line size of 32 bytes. If you are unlucky and you get
five different functions that are frequently called and are a multiple
functions are exactly the wrong spacing that they need more than
four ways, calling them in sequence would always evict the other
ones. The same could of course happen if the problem is the D-cache
or the L2.
Can you try to get a profile using 'perf record' to see where most
time is spent, in both the slowest and the fastest versions?
If the instruction cache is the issue, you should see how the hottest
addresses line up.
Arnd
Powered by blists - more mailing lists