lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Date:   Tue, 6 Sep 2022 10:08:36 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Optimising csum_fold()

The default C version is:

static inline __sum16 csum_fold(__wsum csum)
{
        u32 sum = (__force u32)csum;
        sum = (sum & 0xffff) + (sum >> 16);
        sum = (sum & 0xffff) + (sum >> 16);
        return (__force __sum16)~sum;
}

This is has a register dependency of at least 5.
More if register moves cost and the final mask has to be done.

x86 (and other architectures with a carry flag) may use:
static inline __sum16 csum_fold(__wsum sum)
{
        asm("addl %1, %0                ;\n"
            "adcl $0xffff, %0   ;\n"
            : "=r" (sum)
            : "r" ((__force u32)sum << 16),
              "0" ((__force u32)sum & 0xffff0000));
        return (__force __sum16)(~(__force u32)sum >> 16);
}
This isn't actually any better!

arm64 (and a few others) have the C version:
static inline __sum16 csum_fold(__wsum csum)
{
        u32 sum = (__force u32)csum;
        sum += (sum >> 16) | (sum << 16);
        return (__force __sum16)~(sum >> 16);
}
Assuming the shifts get converted to a rotate
this is one instruction shorter.

Finally arc has the slight variant:
static inline __sum16 csum_fold(__wsum s)
{
        unsigned r = s << 16 | s >> 16; /* ror */
        s = ~s;
        s -= r;
        return s >> 16;
}
On a multi-issue cpu the rotate and ~ can happen in the same clock.
If the compiler is any good the final mask is never needed.
So this has a register dependency chain length of 3.

This looks to be better than the existing versions for
almost all architectures.
(There seem to be a few where the shifts aren't converted
to a rotate. I'd be surprised if the cpus don't have a
rotate instruction - so gcc must get confused.)

See https://godbolt.org/z/on1v6naoE

Annoyingly it isn't trivial to convert most of the architectures to
the generic version because they don't include asm-generic/checksum.h

It has to be said that this function seems to generate 0x0000
instead of 0xffff.
That definitely matters for IPv6.
One solution is to add one to the initial constant checksum
(usually 0) and then add one to the result of csum_fold().

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ