[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD=FV=V1by2=VpbRKzcuRaW6+jj8yL4MwLZf06Nn6oq=GXrOOQ@mail.gmail.com>
Date: Tue, 22 Sep 2020 17:39:24 -0700
From: Doug Anderson <dianders@...omium.org>
To: Ard Biesheuvel <ardb@...nel.org>
Cc: Catalin Marinas <catalin.marinas@....com>,
Will Deacon <will@...nel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Jackie Liu <liuyun01@...inos.cn>,
Linux ARM <linux-arm-kernel@...ts.infradead.org>,
Ard Biesheuvel <ard.biesheuvel@...aro.org>
Subject: Re: [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest
On Mon, Sep 21, 2020 at 11:25 PM Ard Biesheuvel <ardb@...nel.org> wrote:
>
> On Tue, 22 Sep 2020 at 02:27, Douglas Anderson <dianders@...omium.org> wrote:
> >
> > On every boot time we see messages like this:
> >
> > [ 0.025360] calling calibrate_xor_blocks+0x0/0x134 @ 1
> > [ 0.025363] xor: measuring software checksum speed
> > [ 0.035351] 8regs : 3952.000 MB/sec
> > [ 0.045384] 32regs : 4860.000 MB/sec
> > [ 0.055418] arm64_neon: 5900.000 MB/sec
> > [ 0.055423] xor: using function: arm64_neon (5900.000 MB/sec)
> > [ 0.055433] initcall calibrate_xor_blocks+0x0/0x134 returned 0 after 29296 usecs
> >
> > As you can see, we spend 30 ms on every boot re-confirming that, yet
> > again, the arm64_neon implementation is the fastest way to do XOR.
> > ...and the above is on a system with HZ=1000. Due to the way the
> > testing happens, if we have HZ defined to something slower it'll take
> > much longer. HZ=100 means we spend 300 ms on every boot re-confirming
> > a fact that will be the same for every bootup.
> >
> > Trying to super-optimize the xor operation makes a lot of sense if
> > you're using software RAID, but the above is probably not worth it for
> > most Linux users because:
> > 1. Quite a few arm64 kernels are built for embedded systems where
> > software raid isn't common. That means we're spending lots of time
> > on every boot trying to optimize something we don't use.
> > 2. Presumably, if we have neon, it's faster than alternatives. If
> > it's not, it's not expected to be tons slower.
> > 3. Quite a lot of arm64 systems are big.LITTLE. This means that the
> > existing test is somewhat misguided because it's assuming that test
> > results on the boot CPU apply to the other CPUs in the system.
> > This is not necessarily the case.
> >
> > Let's add a new config option that allows us to just use the neon
> > functions (if present) without benchmarking.
> >
> > NOTE: One small side effect is that on an arm64 system _without_ neon
> > we'll end up testing the xor_block_8regs_p and xor_block_32regs_p
> > versions of the function. That's presumably OK since we already test
> > all those when KERNEL_MODE_NEON is disabled.
> >
> > ALSO NOTE: presumably the way to do better than this is to add some
> > sort of per-CPU-core lookup table and jump to a per-CPU-core-specific
> > XOR function each time xor is called. Without seeing evidence that
> > this would really help someone, though, that doesn't seem worth it.
> >
> > Signed-off-by: Douglas Anderson <dianders@...omium.org>
>
> On the two arm64 machines that I happen to have running right now, I get
>
> SynQuacer (Cortex-A53)
>
> 8regs : 1917.000 MB/sec
> 32regs : 2270.000 MB/sec
> arm64_neon: 2053.000 MB/sec
>
> ThunderX2
>
> 8regs : 10170.000 MB/sec
> 32regs : 12051.000 MB/sec
> arm64_neon: 10948.000 MB/sec
>
> so your assertion is not entirely valid.
OK, good to know.
> If the system does not need XOR, it is free not to load the module, so
> there is no reason it has to affect the boot time.
The fact that it was run super early somehow made me just assume that
this couldn't be a module, but of course you're right that it can be a
module. That works for me and saves me my precious boot time. ;-)
That being said, this'll still bite anyone who wants to build this in
for whatever reason. I'll respond to your other email with more...
Powered by blists - more mailing lists