netdev - Re: [PATCH v1] x86/csum: rewrite csum

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CANn89iLJe-M65NGbbj=wYjCKSz35yg3rtWoi4Lh7DMNcXmKNZg@mail.gmail.com>
Date:   Sun, 14 Nov 2021 11:23:46 -0800
From:   Eric Dumazet <edumazet@...gle.com>
To:     David Laight <David.Laight@...lab.com>
Cc:     Alexander Duyck <alexander.duyck@...il.com>,
        Eric Dumazet <eric.dumazet@...il.com>,
        "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        netdev <netdev@...r.kernel.org>,
        "the arch/x86 maintainers" <x86@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH v1] x86/csum: rewrite csum_partial()

On Sun, Nov 14, 2021 at 11:10 AM David Laight <David.Laight@...lab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 14 November 2021 15:04
> >
> > On Sun, Nov 14, 2021 at 6:44 AM David Laight <David.Laight@...lab.com> wrote:
> > >
> > > From: Eric Dumazet
> > > > Sent: 11 November 2021 22:31
> > > ..
> > > > That requires an extra add32_with_carry(), which unfortunately made
> > > > the thing slower for me.
> > > >
> > > > I even hardcoded an inline fast_csum_40bytes() and got best results
> > > > with the 10+1 addl,
> > > > instead of
> > > >  (5 + 1) acql +  mov (needing one extra  register) + shift + addl + adcl
> > >
> > > Did you try something like:
> > >         sum = buf[0];
> > >         val = buf[1]:
> > >         asm(
> > >                 add64 sum, val
> > >                 adc64 sum, buf[2]
> > >                 adc64 sum, buf[3]
> > >                 adc64 sum, buf[4]
> > >                 adc64 sum, 0
> > >         }
> > >         sum_hi = sum >> 32;
> > >         asm(
> > >                 add32 sum, sum_hi
> > >                 adc32 sum, 0
> > >         )
> >
> > This is what I tried. but the last part was using add32_with_carry(),
> > and clang was adding stupid mov to temp variable on the stack,
> > killing the perf.
>
> Persuading the compile the generate the required assembler is an art!
>
> I also ended up using __builtin_bswap32(sum) when the alignment
> was 'odd' - the shift expression didn't always get converted
> to a rotate. Byteswap32 DTRT.
>
> I also noticed that any initial checksum was being added in at the end.
> The 64bit code can almost always handle a 32 bit (or maybe 56bit!)
> input value and add it in 'for free' into the code that does the
> initial alignment.

This has been fixed in V2 : initial csum is used instead of 0

>
> I don't remember testing misaligned buffers.
> But I think it doesn't matter (on cpu anyone cares about!).
> Even Sandy bridge can do two memory reads in one clock.
> So should be able to do a single misaligned read every clock.
> Which almost certainly means that aligning the addresses is pointless.
> (Given you're not trying to do the adcx/adox loop.)
> (Page spanning shouldn't matter.)
>
> For buffers that aren't a multiple of 8 bytes it might be best to
> read the last 8 bytes first and shift left to discard the ones that
> would get added in twice.
> This value can be added to the 32bit 'input' checksum.
> Something like:
>         sum_in += buf[length - 8] << (64 - (length & 7) * 8));
> Annoyingly a special case is needed for buffers shorter than 8 bytes
> to avoid falling off the start of a page.

Yep, Alexander/Peter proposed this already, and it is implemented in V2

>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)