lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 29 Apr 2022 16:49:22 +0200
From:   Arnd Bergmann <arnd@...db.de>
To:     Rafał Miłecki <zajec5@...il.com>
Cc:     Alexander Lobakin <alexandr.lobakin@...el.com>,
        Network Development <netdev@...r.kernel.org>,
        linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
        Russell King <linux@...linux.org.uk>,
        Andrew Lunn <andrew@...n.ch>, Felix Fietkau <nbd@....name>,
        "openwrt-devel@...ts.openwrt.org" <openwrt-devel@...ts.openwrt.org>,
        Florian Fainelli <f.fainelli@...il.com>
Subject: Re: Optimizing kernel compilation / alignments for network performance

On Wed, Apr 27, 2022 at 7:31 PM Rafał Miłecki <zajec5@...il.com> wrote:
> On 27.04.2022 14:56, Alexander Lobakin wrote:

> Thank you Alexander, this appears to be helpful! I decided to ignore
> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
> manually.
>
>
> 1. Without ce5013ff3bec and with -falign-functions=32
> 387 Mb/s
>
> 2. Without ce5013ff3bec and with -falign-functions=64
> 377 Mb/s
>
> 3. With ce5013ff3bec and with -falign-functions=32
> 384 Mb/s
>
> 4. With ce5013ff3bec and with -falign-functions=64
> 377 Mb/s
>
>
> So it seems that:
> 1. -falign-functions=32 = pretty stable high speed
> 2. -falign-functions=64 = very stable slightly lower speed
>
>
> I'm going to perform tests on more commits but if it stays so reliable
> as above that will be a huge success for me.

Note that the problem may not just be the alignment of a particular
function, but also how different function map into your cache.
The Cortex-A9 has a 4-way set-associative L1 cache of 16KB, 32KB or
64KB, with a line size of 32 bytes. If you are unlucky and you get
five different functions that are frequently called and are a multiple
functions are exactly the wrong spacing that they need more than
four ways, calling them in sequence would always evict the other
ones. The same could of course happen if the problem is the D-cache
or the L2.

Can you try to get a profile using 'perf record' to see where most
time is spent, in both the slowest and the fastest versions?
If the instruction cache is the issue, you should see how the hottest
addresses line up.

        Arnd

Powered by blists - more mailing lists