netdev - RE: [PATCH v1] x86/csum: rewrite csum

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <31bd81df79c4488c92c6a149eeceee3c@AcuMS.aculab.com>
Date:   Sun, 14 Nov 2021 19:09:54 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Eric Dumazet' <edumazet@...gle.com>
CC:     Alexander Duyck <alexander.duyck@...il.com>,
        Eric Dumazet <eric.dumazet@...il.com>,
        "David S . Miller" <davem@...emloft.net>,
        "Jakub Kicinski" <kuba@...nel.org>,
        netdev <netdev@...r.kernel.org>,
        "the arch/x86 maintainers" <x86@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: RE: [PATCH v1] x86/csum: rewrite csum_partial()

From: Eric Dumazet
> Sent: 14 November 2021 15:04
> 
> On Sun, Nov 14, 2021 at 6:44 AM David Laight <David.Laight@...lab.com> wrote:
> >
> > From: Eric Dumazet
> > > Sent: 11 November 2021 22:31
> > ..
> > > That requires an extra add32_with_carry(), which unfortunately made
> > > the thing slower for me.
> > >
> > > I even hardcoded an inline fast_csum_40bytes() and got best results
> > > with the 10+1 addl,
> > > instead of
> > >  (5 + 1) acql +  mov (needing one extra  register) + shift + addl + adcl
> >
> > Did you try something like:
> >         sum = buf[0];
> >         val = buf[1]:
> >         asm(
> >                 add64 sum, val
> >                 adc64 sum, buf[2]
> >                 adc64 sum, buf[3]
> >                 adc64 sum, buf[4]
> >                 adc64 sum, 0
> >         }
> >         sum_hi = sum >> 32;
> >         asm(
> >                 add32 sum, sum_hi
> >                 adc32 sum, 0
> >         )
> 
> This is what I tried. but the last part was using add32_with_carry(),
> and clang was adding stupid mov to temp variable on the stack,
> killing the perf.

Persuading the compile the generate the required assembler is an art!

I also ended up using __builtin_bswap32(sum) when the alignment
was 'odd' - the shift expression didn't always get converted
to a rotate. Byteswap32 DTRT.

I also noticed that any initial checksum was being added in at the end.
The 64bit code can almost always handle a 32 bit (or maybe 56bit!)
input value and add it in 'for free' into the code that does the
initial alignment.

I don't remember testing misaligned buffers.
But I think it doesn't matter (on cpu anyone cares about!).
Even Sandy bridge can do two memory reads in one clock.
So should be able to do a single misaligned read every clock.
Which almost certainly means that aligning the addresses is pointless.
(Given you're not trying to do the adcx/adox loop.)
(Page spanning shouldn't matter.)

For buffers that aren't a multiple of 8 bytes it might be best to
read the last 8 bytes first and shift left to discard the ones that
would get added in twice.
This value can be added to the 32bit 'input' checksum.
Something like:
	sum_in += buf[length - 8] << (64 - (length & 7) * 8));
Annoyingly a special case is needed for buffers shorter than 8 bytes
to avoid falling off the start of a page.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)