linux-kernel - Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAFUsyfK-znRWJN7FTMdJaDTd45DgtBQ9ckKGyh8qYqn0eFMMFA@mail.gmail.com>
Date:   Thu, 25 Nov 2021 20:15:04 -0600
From:   Noah Goldstein <goldstein.w.n@...il.com>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     tglx@...utronix.de, mingo@...hat.com,
        Borislav Petkov <bp@...en8.de>, dave.hansen@...ux.intel.com,
        X86 ML <x86@...nel.org>, hpa@...or.com, peterz@...radead.org,
        alexanderduyck@...com, open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c

On Thu, Nov 25, 2021 at 7:50 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <goldstein.w.n@...il.com> wrote:
> >
> > Modify the 8x loop to that it uses two independent
> > accumulators. Despite adding more instructions the latency and
> > throughput of the loop is improved because the `adc` chains can now
> > take advantage of multiple execution units.
>
> Nice !
>
> Note that I get better results if I do a different split, because the
> second chain gets shorter.
>
> First chain adds 5*8 bytes from the buffer, but first bytes are a mere
> load, so that is really 4+1 additions.
>
> Second chain adds 3*8 bytes from the buffer, plus the result coming
> from the first chain, also 4+1 additions.

Good call. With that approach I also see an improvement for the 32 byte
case (which is on the hot path) so this change might actually matter :)

>
> asm("movq 0*8(%[src]),%[res_tmp]\n\t"
>     "addq 1*8(%[src]),%[res_tmp]\n\t"
>     "adcq 2*8(%[src]),%[res_tmp]\n\t"
>     "adcq 3*8(%[src]),%[res_tmp]\n\t"
>     "adcq 4*8(%[src]),%[res_tmp]\n\t"
>     "adcq $0,%[res_tmp]\n\t"
>     "addq 5*8(%[src]),%[res]\n\t"
>     "adcq 6*8(%[src]),%[res]\n\t"
>     "adcq 7*8(%[src]),%[res]\n\t"
>     "adcq %[res_tmp],%[res]\n\t"
>     "adcq $0,%[res]"
>     : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum)
>     : [src] "r" (buff)
>     : "memory");
>
>
> >
> > Make the memory clobbers more precise. 'buff' is read only and we know
> > the exact usage range. There is no reason to write-clobber all memory.
>
> Not sure if that matters in this function ? Or do we expect it being inlined ?

It may matter for LTO build. I also think it can matter for the loop
case. I didn't see
any difference when playing around with the function in userland with:

```
gcc -O3 -march=native -mtune=native checksum.c -o checksum
```

but IIRC if the clobber is loops with inline assembly payloads can be
de-optimized if GCC can't prove the iterations don't affect each other.


>
> Personally, I find the "memory" constraint to be more readable than these casts
> "m"(*(const char(*)[64])buff));
>

Hmm, I personally find it more readable if I can tell what memory
transforms happen
just from reading the clobbers, but you're the maintainer.

Do you want it changed in V2?

> >
> > Relative performance changes on Tigerlake:
> >
> > Time Unit: Ref Cycles
> > Size Unit: Bytes
> >
> > size,   lat old,    lat new,    tput old,   tput new
> >    0,     4.972,      5.054,       4.864,      4.870
>
> Really what matters in modern networking is the case for 40 bytes, and
> eventually 8 bytes.
>
> Can you add these two cases in this nice table ?
>

Sure, with your suggestion in the 32 byte cases there is an improvement there
too.

> We hardly have to checksum anything with NIC that are not decades old.
>
> Apparently making the 64byte loop slightly longer incentives  gcc to
> move it away (our intent with the unlikely() hint).
>
> Anyway I am thinking of providing a specialized inline version for
> IPv6 header checksums (40 + x*8 bytes, x being 0  pretty much all the
> time),
> so we will likely not use csum_partial() anymore.

I see. For now is it worth adding a case for 40 in this implementation?

    if(likely(len == 40)) {
        // optimized 40 + buff aligned case
    }
    else {
        // existing code
    }


>
> Thanks !