lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLnH5B11CtzZ14nMFP7b--7aOfnQqgmsER+NYNzvnVurQ@mail.gmail.com>
Date:   Thu, 25 Nov 2021 17:50:31 -0800
From:   Eric Dumazet <edumazet@...gle.com>
To:     Noah Goldstein <goldstein.w.n@...il.com>
Cc:     tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
        dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com,
        peterz@...radead.org, alexanderduyck@...com,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c

On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <goldstein.w.n@...il.com> wrote:
>
> Modify the 8x loop to that it uses two independent
> accumulators. Despite adding more instructions the latency and
> throughput of the loop is improved because the `adc` chains can now
> take advantage of multiple execution units.

Nice !

Note that I get better results if I do a different split, because the
second chain gets shorter.

First chain adds 5*8 bytes from the buffer, but first bytes are a mere
load, so that is really 4+1 additions.

Second chain adds 3*8 bytes from the buffer, plus the result coming
from the first chain, also 4+1 additions.

asm("movq 0*8(%[src]),%[res_tmp]\n\t"
    "addq 1*8(%[src]),%[res_tmp]\n\t"
    "adcq 2*8(%[src]),%[res_tmp]\n\t"
    "adcq 3*8(%[src]),%[res_tmp]\n\t"
    "adcq 4*8(%[src]),%[res_tmp]\n\t"
    "adcq $0,%[res_tmp]\n\t"
    "addq 5*8(%[src]),%[res]\n\t"
    "adcq 6*8(%[src]),%[res]\n\t"
    "adcq 7*8(%[src]),%[res]\n\t"
    "adcq %[res_tmp],%[res]\n\t"
    "adcq $0,%[res]"
    : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum)
    : [src] "r" (buff)
    : "memory");


>
> Make the memory clobbers more precise. 'buff' is read only and we know
> the exact usage range. There is no reason to write-clobber all memory.

Not sure if that matters in this function ? Or do we expect it being inlined ?

Personally, I find the "memory" constraint to be more readable than these casts
"m"(*(const char(*)[64])buff));

>
> Relative performance changes on Tigerlake:
>
> Time Unit: Ref Cycles
> Size Unit: Bytes
>
> size,   lat old,    lat new,    tput old,   tput new
>    0,     4.972,      5.054,       4.864,      4.870

Really what matters in modern networking is the case for 40 bytes, and
eventually 8 bytes.

Can you add these two cases in this nice table ?

We hardly have to checksum anything with NIC that are not decades old.

Apparently making the 64byte loop slightly longer incentives  gcc to
move it away (our intent with the unlikely() hint).

Anyway I am thinking of providing a specialized inline version for
IPv6 header checksums (40 + x*8 bytes, x being 0  pretty much all the
time),
so we will likely not use csum_partial() anymore.

Thanks !

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ