[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFUsyfLoEckBrnYKUgqWC7AJPTBDfarjBOgBvtK7eGVZj9muYQ@mail.gmail.com>
Date: Sun, 28 Nov 2021 14:59:33 -0600
From: Noah Goldstein <goldstein.w.n@...il.com>
To: David Laight <David.Laight@...lab.com>
Cc: Eric Dumazet <edumazet@...gle.com>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"mingo@...hat.com" <mingo@...hat.com>,
Borislav Petkov <bp@...en8.de>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
X86 ML <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
"peterz@...radead.org" <peterz@...radead.org>,
"alexanderduyck@...com" <alexanderduyck@...com>,
open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c
On Sun, Nov 28, 2021 at 1:47 PM David Laight <David.Laight@...lab.com> wrote:
>
> ...
> > Regarding the 32 byte case, adding two accumulators helps with the latency
> > numbers but causes a regression in throughput for the 40/48 byte cases. Which
> > is the more important metric for the usage of csum_partial()?
> >
> > Here are the numbers for the smaller sizes:
> >
> > size, lat old, lat ver2, lat ver1, tput old, tput ver2, tput ver1
> > 0, 4.961, 4.503, 4.901, 4.887, 4.399, 4.951
> > 8, 5.590, 5.594, 5.620, 4.227, 4.110, 4.252
> > 16, 6.182, 6.398, 6.202, 4.233, 4.062, 4.278
> > 24, 7.392, 7.591, 7.380, 4.256, 4.246, 4.279
> > 32, 7.371, 6.366, 7.390, 4.550, 4.900, 4.537
> > 40, 8.621, 7.496, 8.601, 4.862, 5.162, 4.836
> > 48, 9.406, 8.128, 9.374, 5.206, 5.736, 5.234
> > 56, 10.535, 9.189, 10.522, 5.416, 5.772, 5.447
> > 64, 10.000, 7.487, 7.590, 6.946, 6.975, 6.989
> > 72, 11.192, 8.639, 8.763, 7.210, 7.311, 7.277
> > 80, 11.734, 9.179, 9.409, 7.605, 7.620, 7.548
> > 88, 12.933, 10.545, 10.584, 7.878, 7.902, 7.858
> > 96, 12.952, 9.331, 10.625, 8.168, 8.470, 8.206
> > 104, 14.206, 10.424, 11.839, 8.491, 8.785, 8.502
> > 112, 14.763, 11.403, 12.416, 8.798, 9.134, 8.771
> > 120, 15.955, 12.635, 13.651, 9.175, 9.494, 9.130
> > 128, 15.271, 10.599, 10.724, 9.726, 9.672, 9.655
> >
> > 'ver2' uses two accumulators for 32 byte case and has better latency numbers
> > but regressions in tput compared to 'old' and 'ver1'. 'ver1' is the
> > implementation
> > posted which has essentially the same numbers for tput/lat as 'old'
> > for sizes [0, 63].
>
> Which cpu are you testing on - it will make a big difference ?
Tigerlake, although assuming `adc` as the bottleneck, the results
should be largely independent.
> And what are you measing throughput in?
Running back to back iterations with the same input without any
dependency between iterations. The OoO window will include
multiple iterations at once.
> And are you testing aligned or mis-aligned 64bit reads?
Aligned as that is the common case.
>
> I think one of the performance counters will give 'cpu clocks'.
Time is in Ref Cycles using `rdtsc`
>
> I did some tests early last year and got 8 bytes/clock on broadwell/haswell
> with code that 'loop carried' the carry flag (a single adc chain).
> On the older Intel cpu (Ivy bridge onwards) 'adc' has a latency of 2
> for the result, but the carry flag is available earlier.
> So alternating the target register in the 'adc' chain will give (nearly)
> 8 bytes/clock - I think I got to 7.5.
>
> That can all be done with only 4 reads per interaction.
> IIRC broadwell/haswell only need 2 reads/iteration.
>
> It is actually likely (certainly worth checking) that haswell/broadwell
> can do two misaligned memory reads every clock.
> So it may not be worth aligning the reads (which the old code did).
> In any case aren't tx packets likely to be aligned, and rx ones
> misaligned to some known 4n+2 boundary?
>
> Using adxc/adxo together is a right PITA.
I'm a bit hesitant about adxc/adxo because they are extensions so
support will need to be tested.
> I did get (about) 12 bytes/clock fo long buffers while loop carrying
> both the overflow and carry flags.
>
> Also is there a copy of the patched code anywhere?
https://lore.kernel.org/lkml/CANn89iLpFOok_tv=DKsLX1mxZGdHQgATdW4Xs0rc6oaXQEa5Ng@mail.gmail.com/T/
> I think I've missed some of the patches and they are difficult to follow.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
Powered by blists - more mailing lists