[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <31bd81df79c4488c92c6a149eeceee3c@AcuMS.aculab.com>
Date: Sun, 14 Nov 2021 19:09:54 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Eric Dumazet' <edumazet@...gle.com>
CC: Alexander Duyck <alexander.duyck@...il.com>,
Eric Dumazet <eric.dumazet@...il.com>,
"David S . Miller" <davem@...emloft.net>,
"Jakub Kicinski" <kuba@...nel.org>,
netdev <netdev@...r.kernel.org>,
"the arch/x86 maintainers" <x86@...nel.org>,
Peter Zijlstra <peterz@...radead.org>
Subject: RE: [PATCH v1] x86/csum: rewrite csum_partial()
From: Eric Dumazet
> Sent: 14 November 2021 15:04
>
> On Sun, Nov 14, 2021 at 6:44 AM David Laight <David.Laight@...lab.com> wrote:
> >
> > From: Eric Dumazet
> > > Sent: 11 November 2021 22:31
> > ..
> > > That requires an extra add32_with_carry(), which unfortunately made
> > > the thing slower for me.
> > >
> > > I even hardcoded an inline fast_csum_40bytes() and got best results
> > > with the 10+1 addl,
> > > instead of
> > > (5 + 1) acql + mov (needing one extra register) + shift + addl + adcl
> >
> > Did you try something like:
> > sum = buf[0];
> > val = buf[1]:
> > asm(
> > add64 sum, val
> > adc64 sum, buf[2]
> > adc64 sum, buf[3]
> > adc64 sum, buf[4]
> > adc64 sum, 0
> > }
> > sum_hi = sum >> 32;
> > asm(
> > add32 sum, sum_hi
> > adc32 sum, 0
> > )
>
> This is what I tried. but the last part was using add32_with_carry(),
> and clang was adding stupid mov to temp variable on the stack,
> killing the perf.
Persuading the compile the generate the required assembler is an art!
I also ended up using __builtin_bswap32(sum) when the alignment
was 'odd' - the shift expression didn't always get converted
to a rotate. Byteswap32 DTRT.
I also noticed that any initial checksum was being added in at the end.
The 64bit code can almost always handle a 32 bit (or maybe 56bit!)
input value and add it in 'for free' into the code that does the
initial alignment.
I don't remember testing misaligned buffers.
But I think it doesn't matter (on cpu anyone cares about!).
Even Sandy bridge can do two memory reads in one clock.
So should be able to do a single misaligned read every clock.
Which almost certainly means that aligning the addresses is pointless.
(Given you're not trying to do the adcx/adox loop.)
(Page spanning shouldn't matter.)
For buffers that aren't a multiple of 8 bytes it might be best to
read the last 8 bytes first and shift left to discard the ones that
would get added in twice.
This value can be added to the 32bit 'input' checksum.
Something like:
sum_in += buf[length - 8] << (64 - (length & 7) * 8));
Annoyingly a special case is needed for buffers shorter than 8 bytes
to avoid falling off the start of a page.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists