netdev - RE: [RFC] x86/csum: rewrite csum

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <e08af965e5b4422e9b38d8ccd90f8e7b@AcuMS.aculab.com>
Date:   Mon, 15 Nov 2021 10:23:31 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Eric Dumazet' <eric.dumazet@...il.com>,
        "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>
CC:     netdev <netdev@...r.kernel.org>,
        Eric Dumazet <edumazet@...gle.com>,
        "x86@...nel.org" <x86@...nel.org>,
        Alexander Duyck <alexander.duyck@...il.com>
Subject: RE: [RFC] x86/csum: rewrite csum_partial()

From: David Laight
> Sent: 14 November 2021 14:12
> ..
> > If you aren't worried (too much) about cpu before Broadwell then IIRC
> > this loop gets close to 8 bytes/clock:
> >
> > +               "10:    jecxz 20f\n"
> > +               "       adc   (%[buff], %[len]), %[sum]\n"
> > +               "       adc   8(%[buff], %[len]), %[sum]\n"
> > +               "       lea   16(%[len]), %[tmp]\n"
> > +               "       jmp   10b\n"
> > +               " 20:"
> 
> It is even possible a loop based on:
> 	10:	adc	(%[buff], %[len], 8), %sum
> 		inc	%[len]
> 		jnz	10b
> will run at 8 bytes per clock on very recent Intel cpu.

It doesn't on i7-7700.
(which I probably tested last year).

But the first loop does run twice as fast - and will only
be beaten by the adcx/adox loop.
So there is no need to unroll to more than 2 reads/loop.

For cpu between Ivy bridge and Broadwell you want to use
separate 'sum' registers to avoid the 2 clock latency
of the adc result.
That should beat the 4 bytes/clock of the current loop.
But does need an extra unroll to get near 8 bytes/clock.

For older cpu (nehalem/core2) the 'jecxz' loop is about the
only way to 'loop carry' the carry flag without the
6 clock penalty for the partial flags register update.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)