netdev - Re: [PATCH] arm64: do_csum: implement accelerated scalar version

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKv+Gu-HJ1fRoetsYMLKkpGa4QCRfCJ2WAhcX=gUfonR4F-bEQ@mail.gmail.com>
Date:   Thu, 28 Feb 2019 16:28:30 +0100
From:   Ard Biesheuvel <ard.biesheuvel@...aro.org>
To:     Robin Murphy <robin.murphy@....com>
Cc:     Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Catalin Marinas <catalin.marinas@....com>,
        "<netdev@...r.kernel.org>" <netdev@...r.kernel.org>,
        "huanglingyan (A)" <huanglingyan2@...wei.com>,
        Will Deacon <will.deacon@....com>,
        linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
        Steve Capper <steve.capper@....com>
Subject: Re: [PATCH] arm64: do_csum: implement accelerated scalar version

On Thu, 28 Feb 2019 at 16:14, Robin Murphy <robin.murphy@....com> wrote:
>
> Hi Ard,
>
> On 28/02/2019 14:16, Ard Biesheuvel wrote:
> > (+ Catalin)
> >
> > On Tue, 19 Feb 2019 at 16:08, Ilias Apalodimas
> > <ilias.apalodimas@...aro.org> wrote:
> >>
> >> On Tue, Feb 19, 2019 at 12:08:42AM +0100, Ard Biesheuvel wrote:
> >>> It turns out that the IP checksumming code is still exercised often,
> >>> even though one might expect that modern NICs with checksum offload
> >>> have no use for it. However, as Lingyan points out, there are
> >>> combinations of features where the network stack may still fall back
> >>> to software checksumming, and so it makes sense to provide an
> >>> optimized implementation in software as well.
> >>>
> >>> So provide an implementation of do_csum() in scalar assembler, which,
> >>> unlike C, gives direct access to the carry flag, making the code run
> >>> substantially faster. The routine uses overlapping 64 byte loads for
> >>> all input size > 64 bytes, in order to reduce the number of branches
> >>> and improve performance on cores with deep pipelines.
> >>>
> >>> On Cortex-A57, this implementation is on par with Lingyan's NEON
> >>> implementation, and roughly 7x as fast as the generic C code.
> >>>
> >>> Cc: "huanglingyan (A)" <huanglingyan2@...wei.com>
> >>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@...aro.org>
> > ...
> >>
> >> Acked-by: Ilias Apalodimas <ilias.apalodimas@...aro.org>
> >
> > Full patch here
> >
> > https://lore.kernel.org/linux-arm-kernel/20190218230842.11448-1-ard.biesheuvel@linaro.org/
> >
> > This was a follow-up to some discussions about Lingyan's NEON code,
> > CC'ed to netdev@ so people could chime in as to whether we need
> > accelerated checksumming code in the first place.

Thanks for taking a look.

> FWIW ever since we did ip_fast_csum() I've been meaning to see how well
> I can do with a similar tweaked C implementation for this (mostly for
> fun). Since I've recently dug out my RK3328 box for other reasons I'll
> give this a test - that's a weedy little quad-A53 whose GbE hardware
> checksumming is slightly busted and has to be turned off, so the
> do_csum() overhead under heavy network load is comparatively massive.
> (plus it's non-EFI so I should be able to try big-endian easily too)
>

Yes please. I've been meaning to run this on A72 myself, but ever
since my MacchiatoBin self-combusted, I've been relying on AWS for
this, which is a bit finicky.

As for the C implementation, not having access to the carry flag is
pretty limiting, so I wonder how you intend to get around that.

> The asm looks pretty reasonable to me - instinct says there's *possibly*
> some value for out-of-order cores in doing the 8-way accumulations in a
> more pairwise fashion, but I guess either way the carry flag dependency
> is going to dominate, so it may well be moot.

Yes. In fact, I was surprised the speedup is as dramatic as it is
despite of this, but I guess they optimize for this rather well at the
uarch level.

> What may be more
> worthwhile is taking the effort to align the source pointer, at least
> for larger inputs, so as to be kinder to little cores - according to its
> optimisation guide, A55 is fairly sensitive to unaligned loads, so I'd
> assume that's true of its older/smaller friends too. I'll see what I can
> measure in practice - until proven otherwise I'd have no great objection
> to merging this patch as-is if the need is real. Improvements can always
> come later :)
>

Good point re alignment, I didn't consider that at all tbh.

I'll let the maintainers decide whether/when to merge this. I don't
feel strongly either way.