netdev - Optimising csum

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <c6051e9d96b94c94aff3a41572dfa851@AcuMS.aculab.com>
Date:   Tue, 6 Sep 2022 10:08:36 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Optimising csum_fold()

The default C version is:

static inline __sum16 csum_fold(__wsum csum)
{
        u32 sum = (__force u32)csum;
        sum = (sum & 0xffff) + (sum >> 16);
        sum = (sum & 0xffff) + (sum >> 16);
        return (__force __sum16)~sum;
}

This is has a register dependency of at least 5.
More if register moves cost and the final mask has to be done.

x86 (and other architectures with a carry flag) may use:
static inline __sum16 csum_fold(__wsum sum)
{
        asm("addl %1, %0                ;\n"
            "adcl $0xffff, %0   ;\n"
            : "=r" (sum)
            : "r" ((__force u32)sum << 16),
              "0" ((__force u32)sum & 0xffff0000));
        return (__force __sum16)(~(__force u32)sum >> 16);
}
This isn't actually any better!

arm64 (and a few others) have the C version:
static inline __sum16 csum_fold(__wsum csum)
{
        u32 sum = (__force u32)csum;
        sum += (sum >> 16) | (sum << 16);
        return (__force __sum16)~(sum >> 16);
}
Assuming the shifts get converted to a rotate
this is one instruction shorter.

Finally arc has the slight variant:
static inline __sum16 csum_fold(__wsum s)
{
        unsigned r = s << 16 | s >> 16; /* ror */
        s = ~s;
        s -= r;
        return s >> 16;
}
On a multi-issue cpu the rotate and ~ can happen in the same clock.
If the compiler is any good the final mask is never needed.
So this has a register dependency chain length of 3.

This looks to be better than the existing versions for
almost all architectures.
(There seem to be a few where the shifts aren't converted
to a rotate. I'd be surprised if the cpus don't have a
rotate instruction - so gcc must get confused.)

See https://godbolt.org/z/on1v6naoE

Annoyingly it isn't trivial to convert most of the architectures to
the generic version because they don't include asm-generic/checksum.h

It has to be said that this function seems to generate 0x0000
instead of 0xffff.
That definitely matters for IPv6.
One solution is to add one to the initial constant checksum
(usually 0) and then add one to the result of csum_fold().

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)