linux-kernel - Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAFUsyfLoEckBrnYKUgqWC7AJPTBDfarjBOgBvtK7eGVZj9muYQ@mail.gmail.com>
Date:   Sun, 28 Nov 2021 14:59:33 -0600
From:   Noah Goldstein <goldstein.w.n@...il.com>
To:     David Laight <David.Laight@...lab.com>
Cc:     Eric Dumazet <edumazet@...gle.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        Borislav Petkov <bp@...en8.de>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        X86 ML <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "alexanderduyck@...com" <alexanderduyck@...com>,
        open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c

On Sun, Nov 28, 2021 at 1:47 PM David Laight <David.Laight@...lab.com> wrote:
>
> ...
> > Regarding the 32 byte case, adding two accumulators helps with the latency
> > numbers but causes a regression in throughput for the 40/48 byte cases. Which
> > is the more important metric for the usage of csum_partial()?
> >
> > Here are the numbers for the smaller sizes:
> >
> > size, lat old,    lat ver2,    lat ver1,    tput old,   tput ver2,   tput ver1
> >    0,   4.961,       4.503,       4.901,       4.887,       4.399,       4.951
> >    8,   5.590,       5.594,       5.620,       4.227,       4.110,       4.252
> >   16,   6.182,       6.398,       6.202,       4.233,       4.062,       4.278
> >   24,   7.392,       7.591,       7.380,       4.256,       4.246,       4.279
> >   32,   7.371,       6.366,       7.390,       4.550,       4.900,       4.537
> >   40,   8.621,       7.496,       8.601,       4.862,       5.162,       4.836
> >   48,   9.406,       8.128,       9.374,       5.206,       5.736,       5.234
> >   56,  10.535,       9.189,      10.522,       5.416,       5.772,       5.447
> >   64,  10.000,       7.487,       7.590,       6.946,       6.975,       6.989
> >   72,  11.192,       8.639,       8.763,       7.210,       7.311,       7.277
> >   80,  11.734,       9.179,       9.409,       7.605,       7.620,       7.548
> >   88,  12.933,      10.545,      10.584,       7.878,       7.902,       7.858
> >   96,  12.952,       9.331,      10.625,       8.168,       8.470,       8.206
> >  104,  14.206,      10.424,      11.839,       8.491,       8.785,       8.502
> >  112,  14.763,      11.403,      12.416,       8.798,       9.134,       8.771
> >  120,  15.955,      12.635,      13.651,       9.175,       9.494,       9.130
> >  128,  15.271,      10.599,      10.724,       9.726,       9.672,       9.655
> >
> > 'ver2' uses two accumulators for 32 byte case and has better latency numbers
> > but regressions in tput compared to 'old' and 'ver1'. 'ver1' is the
> > implementation
> > posted which has essentially the same numbers for tput/lat as 'old'
> > for sizes [0, 63].
>
> Which cpu are you testing on - it will make a big difference ?

Tigerlake, although assuming `adc` as the bottleneck, the results
should be largely independent.

> And what are you measing throughput in?

Running back to back iterations with the same input without any
dependency between iterations. The OoO window will include
multiple iterations at once.

> And are you testing aligned or mis-aligned 64bit reads?

Aligned as that is the common case.

>
> I think one of the performance counters will give 'cpu clocks'.

Time is in Ref Cycles using `rdtsc`
>
> I did some tests early last year and got 8 bytes/clock on broadwell/haswell
> with code that 'loop carried' the carry flag (a single adc chain).
> On the older Intel cpu (Ivy bridge onwards) 'adc' has a latency of 2
> for the result, but the carry flag is available earlier.
> So alternating the target register in the 'adc' chain will give (nearly)
> 8 bytes/clock - I think I got to 7.5.
>
> That can all be done with only 4 reads per interaction.
> IIRC broadwell/haswell only need 2 reads/iteration.
>
> It is actually likely (certainly worth checking) that haswell/broadwell
> can do two misaligned memory reads every clock.
> So it may not be worth aligning the reads (which the old code did).
> In any case aren't tx packets likely to be aligned, and rx ones
> misaligned to some known 4n+2 boundary?
>
> Using adxc/adxo together is a right PITA.

I'm a bit hesitant about adxc/adxo because they are extensions so
support will need to be tested.

> I did get (about) 12 bytes/clock fo long buffers while loop carrying
> both the overflow and carry flags.
>
> Also is there a copy of the patched code anywhere?

https://lore.kernel.org/lkml/CANn89iLpFOok_tv=DKsLX1mxZGdHQgATdW4Xs0rc6oaXQEa5Ng@mail.gmail.com/T/


> I think I've missed some of the patches and they are difficult to follow.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)