[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e08af965e5b4422e9b38d8ccd90f8e7b@AcuMS.aculab.com>
Date: Mon, 15 Nov 2021 10:23:31 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Eric Dumazet' <eric.dumazet@...il.com>,
"David S . Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>
CC: netdev <netdev@...r.kernel.org>,
Eric Dumazet <edumazet@...gle.com>,
"x86@...nel.org" <x86@...nel.org>,
Alexander Duyck <alexander.duyck@...il.com>
Subject: RE: [RFC] x86/csum: rewrite csum_partial()
From: David Laight
> Sent: 14 November 2021 14:12
> ..
> > If you aren't worried (too much) about cpu before Broadwell then IIRC
> > this loop gets close to 8 bytes/clock:
> >
> > + "10: jecxz 20f\n"
> > + " adc (%[buff], %[len]), %[sum]\n"
> > + " adc 8(%[buff], %[len]), %[sum]\n"
> > + " lea 16(%[len]), %[tmp]\n"
> > + " jmp 10b\n"
> > + " 20:"
>
> It is even possible a loop based on:
> 10: adc (%[buff], %[len], 8), %sum
> inc %[len]
> jnz 10b
> will run at 8 bytes per clock on very recent Intel cpu.
It doesn't on i7-7700.
(which I probably tested last year).
But the first loop does run twice as fast - and will only
be beaten by the adcx/adox loop.
So there is no need to unroll to more than 2 reads/loop.
For cpu between Ivy bridge and Broadwell you want to use
separate 'sum' registers to avoid the 2 clock latency
of the adc result.
That should beat the 4 bytes/clock of the current loop.
But does need an extra unroll to get near 8 bytes/clock.
For older cpu (nehalem/core2) the 'jecxz' loop is about the
only way to 'loop carry' the carry flag without the
6 clock penalty for the partial flags register update.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists