linux-kernel - RE: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <ca8dcc5b6fbf47b29d55a2ab9815c182@AcuMS.aculab.com>
Date:   Thu, 2 Dec 2021 21:11:41 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Noah Goldstein' <goldstein.w.n@...il.com>,
        Eric Dumazet <edumazet@...gle.com>
CC:     "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        Borislav Petkov <bp@...en8.de>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        X86 ML <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "alexanderduyck@...com" <alexanderduyck@...com>,
        "open list" <linux-kernel@...r.kernel.org>,
        netdev <netdev@...r.kernel.org>
Subject: RE: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in
 csum_partial.c

From: Noah Goldstein
> Sent: 02 December 2021 20:19
> 
> On Thu, Dec 2, 2021 at 9:01 AM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Thu, Dec 2, 2021 at 6:24 AM David Laight <David.Laight@...lab.com> wrote:
> > >
> > > I've dug out my test program and measured the performance of
> > > various copied of the inner loop - usually 64 bytes/iteration.
> > > Code is below.
> > >
> > > It uses the hardware performance counter to get the number of
> > > clocks the inner loop takes.
> > > This is reasonable stable once the branch predictor has settled down.
> > > So the different in clocks between a 64 byte buffer and a 128 byte
> > > buffer is the number of clocks for 64 bytes.
> 
> Intuitively 10 passes is a bit low.

I'm doing 10 separate measurements.
The first one is much slower because the cache is cold.
All the ones after (typically) number 5 or 6 tend to give the same answer.
10 is plenty to give you that 'warm fuzzy feeling' that you've got
a consistent answer.

Run the program 5 or 6 times with the same parameters and you sometimes
get a different stable value - probably something to do with stack and
data physical pages.
Was more obvious when I was timing a system call.

> Also you might consider aligning
> the `csum64` function and possibly the loops.

Won't matter here, instruction decode isn't the problem.
Also the uops all come out of the loop uop cache.

> There a reason you put ` jrcxz` at the beginning of the loops instead
> of the end?

jrcxz is 'jump if cx zero' - hard to use at the bottom of a loop!

The 'paired' loop end instruction is 'loop' - decrement %cx and jump non-zero.
But that is 7+ cycles on current Intel cpu (ok on amd ones).

I can get a two clock loop with jrcxz and jmp - as in the examples.
But it is more stable taken out to 4 clocks.

You can't do a one clock loop :-(

> > > (Unlike the TSC the pmc count doesn't depend on the cpu frequency.)
> > >
> > > What is interesting is that even some of the trivial loops appear
> > > to be doing 16 bytes per clock for short buffers - which is impossible.
> > > Checksum 1k bytes and you get an entirely different answer.
> > > The only loop that really exceeds 8 bytes/clock for long buffers
> > > is the adxc/adoc one.
> > >
> > > What is almost certainly happening is that all the memory reads and
> > > the dependant add/adc instructions are all queued up in the 'out of
> > > order' execution unit.
> > > Since 'rdpmc' isn't a serialising instruction they can still be
> > > outstanding when the function returns.
> > > Uncomment the 'rdtsc' and you get much slower values for short buffers.
> 
> Maybe add an `lfence` before / after `csum64`

That's probably less strong than rdtsc, I might try it.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)