[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ca8dcc5b6fbf47b29d55a2ab9815c182@AcuMS.aculab.com>
Date: Thu, 2 Dec 2021 21:11:41 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Noah Goldstein' <goldstein.w.n@...il.com>,
Eric Dumazet <edumazet@...gle.com>
CC: "tglx@...utronix.de" <tglx@...utronix.de>,
"mingo@...hat.com" <mingo@...hat.com>,
Borislav Petkov <bp@...en8.de>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
X86 ML <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
"peterz@...radead.org" <peterz@...radead.org>,
"alexanderduyck@...com" <alexanderduyck@...com>,
"open list" <linux-kernel@...r.kernel.org>,
netdev <netdev@...r.kernel.org>
Subject: RE: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in
csum_partial.c
From: Noah Goldstein
> Sent: 02 December 2021 20:19
>
> On Thu, Dec 2, 2021 at 9:01 AM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Thu, Dec 2, 2021 at 6:24 AM David Laight <David.Laight@...lab.com> wrote:
> > >
> > > I've dug out my test program and measured the performance of
> > > various copied of the inner loop - usually 64 bytes/iteration.
> > > Code is below.
> > >
> > > It uses the hardware performance counter to get the number of
> > > clocks the inner loop takes.
> > > This is reasonable stable once the branch predictor has settled down.
> > > So the different in clocks between a 64 byte buffer and a 128 byte
> > > buffer is the number of clocks for 64 bytes.
>
> Intuitively 10 passes is a bit low.
I'm doing 10 separate measurements.
The first one is much slower because the cache is cold.
All the ones after (typically) number 5 or 6 tend to give the same answer.
10 is plenty to give you that 'warm fuzzy feeling' that you've got
a consistent answer.
Run the program 5 or 6 times with the same parameters and you sometimes
get a different stable value - probably something to do with stack and
data physical pages.
Was more obvious when I was timing a system call.
> Also you might consider aligning
> the `csum64` function and possibly the loops.
Won't matter here, instruction decode isn't the problem.
Also the uops all come out of the loop uop cache.
> There a reason you put ` jrcxz` at the beginning of the loops instead
> of the end?
jrcxz is 'jump if cx zero' - hard to use at the bottom of a loop!
The 'paired' loop end instruction is 'loop' - decrement %cx and jump non-zero.
But that is 7+ cycles on current Intel cpu (ok on amd ones).
I can get a two clock loop with jrcxz and jmp - as in the examples.
But it is more stable taken out to 4 clocks.
You can't do a one clock loop :-(
> > > (Unlike the TSC the pmc count doesn't depend on the cpu frequency.)
> > >
> > > What is interesting is that even some of the trivial loops appear
> > > to be doing 16 bytes per clock for short buffers - which is impossible.
> > > Checksum 1k bytes and you get an entirely different answer.
> > > The only loop that really exceeds 8 bytes/clock for long buffers
> > > is the adxc/adoc one.
> > >
> > > What is almost certainly happening is that all the memory reads and
> > > the dependant add/adc instructions are all queued up in the 'out of
> > > order' execution unit.
> > > Since 'rdpmc' isn't a serialising instruction they can still be
> > > outstanding when the function returns.
> > > Uncomment the 'rdtsc' and you get much slower values for short buffers.
>
> Maybe add an `lfence` before / after `csum64`
That's probably less strong than rdtsc, I might try it.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists