linux-kernel - Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFUsyfLeVGW8etXHuSGvYy_RoS3RGaA1L+NLKnpc7EsSMVORBg@mail.gmail.com>
Date:   Fri, 26 Nov 2021 12:17:23 -0600
From:   Noah Goldstein <goldstein.w.n@...il.com>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     tglx@...utronix.de, mingo@...hat.com,
        Borislav Petkov <bp@...en8.de>, dave.hansen@...ux.intel.com,
        X86 ML <x86@...nel.org>, hpa@...or.com, peterz@...radead.org,
        alexanderduyck@...com, open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c

On Fri, Nov 26, 2021 at 10:09 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Thu, Nov 25, 2021 at 6:15 PM Noah Goldstein <goldstein.w.n@...il.com> wrote:
> >
> > On Thu, Nov 25, 2021 at 7:50 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <goldstein.w.n@...il.com> wrote:
> > > >
> > > > Modify the 8x loop to that it uses two independent
> > > > accumulators. Despite adding more instructions the latency and
> > > > throughput of the loop is improved because the `adc` chains can now
> > > > take advantage of multiple execution units.
> > >
> > > Nice !
> > >
> > > Note that I get better results if I do a different split, because the
> > > second chain gets shorter.
> > >
> > > First chain adds 5*8 bytes from the buffer, but first bytes are a mere
> > > load, so that is really 4+1 additions.
> > >
> > > Second chain adds 3*8 bytes from the buffer, plus the result coming
> > > from the first chain, also 4+1 additions.
> >
> > Good call. With that approach I also see an improvement for the 32 byte
> > case (which is on the hot path) so this change might actually matter :)
> >
> > >
> > > asm("movq 0*8(%[src]),%[res_tmp]\n\t"
> > >     "addq 1*8(%[src]),%[res_tmp]\n\t"
> > >     "adcq 2*8(%[src]),%[res_tmp]\n\t"
> > >     "adcq 3*8(%[src]),%[res_tmp]\n\t"
> > >     "adcq 4*8(%[src]),%[res_tmp]\n\t"
> > >     "adcq $0,%[res_tmp]\n\t"
> > >     "addq 5*8(%[src]),%[res]\n\t"
> > >     "adcq 6*8(%[src]),%[res]\n\t"
> > >     "adcq 7*8(%[src]),%[res]\n\t"
> > >     "adcq %[res_tmp],%[res]\n\t"
> > >     "adcq $0,%[res]"
> > >     : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum)
> > >     : [src] "r" (buff)
> > >     : "memory");
> > >
> > >
> > > >
> > > > Make the memory clobbers more precise. 'buff' is read only and we know
> > > > the exact usage range. There is no reason to write-clobber all memory.
> > >
> > > Not sure if that matters in this function ? Or do we expect it being inlined ?
> >
> > It may matter for LTO build. I also think it can matter for the loop
> > case. I didn't see
> > any difference when playing around with the function in userland with:
> >
> > ```
> > gcc -O3 -march=native -mtune=native checksum.c -o checksum
> > ```
> >
> > but IIRC if the clobber is loops with inline assembly payloads can be
> > de-optimized if GCC can't prove the iterations don't affect each other.
> >
> >
> > >
> > > Personally, I find the "memory" constraint to be more readable than these casts
> > > "m"(*(const char(*)[64])buff));
> > >
> >
> > Hmm, I personally find it more readable if I can tell what memory
> > transforms happen
> > just from reading the clobbers, but you're the maintainer.
>
> I am not x86 maintainer at all, I was only giving an opinion.
>
> >
> > Do you want it changed in V2?
>
> Let's hear from x86 maintainers :)
>
> >
> > > >
> > > > Relative performance changes on Tigerlake:
> > > >
> > > > Time Unit: Ref Cycles
> > > > Size Unit: Bytes
> > > >
> > > > size,   lat old,    lat new,    tput old,   tput new
> > > >    0,     4.972,      5.054,       4.864,      4.870
> > >
> > > Really what matters in modern networking is the case for 40 bytes, and
> > > eventually 8 bytes.
> > >
> > > Can you add these two cases in this nice table ?
> > >
> >
> > Sure, with your suggestion in the 32 byte cases there is an improvement there
> > too.
> >
> > > We hardly have to checksum anything with NIC that are not decades old.
> > >
> > > Apparently making the 64byte loop slightly longer incentives  gcc to
> > > move it away (our intent with the unlikely() hint).
> > >
> > > Anyway I am thinking of providing a specialized inline version for
> > > IPv6 header checksums (40 + x*8 bytes, x being 0  pretty much all the
> > > time),
> > > so we will likely not use csum_partial() anymore.
> >
> > I see. For now is it worth adding a case for 40 in this implementation?
> >
> >     if(likely(len == 40)) {
> >         // optimized 40 + buff aligned case
> >     }
> >     else {
> >         // existing code
> >     }
>
> This is what I had first in my tree to demonstrate what gains could be achieved.
>
> But decided to keep csum_partial() as a generic function, yet optimize it a bit.
>
> My plan is to have a new inline helper, that could be a mere
> csum_partial() on arches that do not have
> an optimized version.
>
> Preliminary patch to show the idea (this only compiles on x86, we have
> to add the generic definition)
> First call point is in GRO stack.
> Note some of us in the US are not working until next Monday.
>
> diff --git a/arch/x86/include/asm/checksum_64.h
> b/arch/x86/include/asm/checksum_64.h
> index 407beebadaf45a748f91a36b78bd1d023449b132..af93fb53b480ab7102db71c32ab6ca9604c6b5fb
> 100644
> --- a/arch/x86/include/asm/checksum_64.h
> +++ b/arch/x86/include/asm/checksum_64.h
> @@ -182,4 +182,26 @@ static inline __wsum csum_add(__wsum csum, __wsum addend)
>                                                 (__force unsigned)addend);
>  }
>
> +static inline __wsum ipv6_csum_partial(const void *buff, int len, __wsum sum)
> +{
> +       u64 temp64;
> +
> +       if (unlikely(len == 40))
> +               return csum_partial(buff, len, sum);
> +
> +       temp64 = (__force u64)sum;
> +       asm("addq 0*8(%[src]),%[res]\n\t"
> +           "adcq 1*8(%[src]),%[res]\n\t"
> +           "adcq 2*8(%[src]),%[res]\n\t"
> +           "adcq 3*8(%[src]),%[res]\n\t"
> +           "adcq 4*8(%[src]),%[res]\n\t"
> +           "adcq 5*8(%[src]),%[res]\n\t"
> +           "adcq $0,%[res]"
> +           : [res] "+r" (temp64)
> +           : [src] "r" (buff)
> +           : "memory");
> +       return (__force __wsum)add32_with_carry(temp64 >> 32, temp64);

Makes sense. Although if you inline I think you definitely will want a more
conservative clobber than just "memory". Also I think with 40 you also will
get some value from two counters.

Did you see the number/question I posted about two accumulators for 32
byte case?
Its a judgement call about latency vs throughput that I don't really have an
answer for.

> +}
> +#define ipv6_csum_partial ipv6_csum_partial
> +
>  #endif /* _ASM_X86_CHECKSUM_64_H */
> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
> index 1b9827ff8ccf48e61e233e39d671aa67c8fff0ab..50d1d0c92bdce37ac2f48ded8edec02e1ba9096d
> 100644
> --- a/net/ipv6/ip6_offload.c
> +++ b/net/ipv6/ip6_offload.c
> @@ -274,7 +274,9 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff
> *ipv6_gro_receive(struct list_head *head,
>         NAPI_GRO_CB(skb)->is_atomic = true;
>         NAPI_GRO_CB(skb)->flush |= flush;
>
> -       skb_gro_postpull_rcsum(skb, iph, nlen);
> +       if (NAPI_GRO_CB(skb)->csum_valid)
> +               NAPI_GRO_CB(skb)->csum = ~ipv6_csum_partial(iph, nlen,
> +                                               ~NAPI_GRO_CB(skb)->csum);
>
>         pp = indirect_call_gro_receive_l4(tcp6_gro_receive, udp6_gro_receive,
>                                          ops->callbacks.gro_receive, head, skb);