netdev - Re: [RFC] x86/csum: rewrite csum

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YYzd+zdzqUM5/ZKL@hirez.programming.kicks-ass.net>
Date:   Thu, 11 Nov 2021 10:10:19 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Eric Dumazet <eric.dumazet@...il.com>
Cc:     "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        netdev <netdev@...r.kernel.org>,
        Eric Dumazet <edumazet@...gle.com>, x86@...nel.org,
        Alexander Duyck <alexander.duyck@...il.com>
Subject: Re: [RFC] x86/csum: rewrite csum_partial()

On Wed, Nov 10, 2021 at 10:53:22PM -0800, Eric Dumazet wrote:
> +		/*
> +		 * This implements an optimized version of
> +		 * switch (dwords) {
> +		 * case 15: res = add_with_carry(res, buf32[14]); fallthrough;
> +		 * case 14: res = add_with_carry(res, buf32[13]); fallthrough;
> +		 * case 13: res = add_with_carry(res, buf32[12]); fallthrough;
> +		 * ...
> +		 * case 3: res = add_with_carry(res, buf32[2]); fallthrough;
> +		 * case 2: res = add_with_carry(res, buf32[1]); fallthrough;
> +		 * case 1: res = add_with_carry(res, buf32[0]); fallthrough;
> +		 * }
> +		 *
> +		 * "adcl 8byteoff(%reg1),%reg2" are using either 3 or 4 bytes.
> +		 */
> +		asm("	call 1f\n"
> +		    "1:	pop %[dest]\n"

That's terrible. I think on x86_64 we can do: lea (%%rip), %[dest], not
sure what would be the best way on i386.

> +		    "	lea (2f-1b)(%[dest],%[skip],4),%[dest]\n"
> +		    "	clc\n"
> +		    "	jmp *%[dest]\n               .align 4\n"

That's an indirect branch, you can't do that these days. This would need
to use JMP_NOSPEC (except we don't have a !ASSEMBLER version of that.
But that would also completely and utterly destroy performance.

Also, objtool would complain about this if it hadn't tripped over that
first instruction:

 arch/x86/lib/csum-partial_64.o: warning: objtool: do_csum()+0x84: indirect jump found in RETPOLINE build

I'm not sure what the best way is to unroll loops without using computed
gotos/jump-tables though :/

> +		    "2:\n"
> +		    "	adcl 14*4(%[src]),%[res]\n   .align 4\n"
> +		    "	adcl 13*4(%[src]),%[res]\n   .align 4\n"
> +		    "	adcl 12*4(%[src]),%[res]\n   .align 4\n"
> +		    "	adcl 11*4(%[src]),%[res]\n   .align 4\n"
> +		    "	adcl 10*4(%[src]),%[res]\n   .align 4\n"
> +		    "	adcl 9*4(%[src]),%[res]\n   .align 4\n"
> +		    "	adcl 8*4(%[src]),%[res]\n   .align 4\n"
> +		    "	adcl 7*4(%[src]),%[res]\n   .align 4\n"
> +		    "	adcl 6*4(%[src]),%[res]\n   .align 4\n"
> +		    "	adcl 5*4(%[src]),%[res]\n   .align 4\n"
> +		    "	adcl 4*4(%[src]),%[res]\n   .align 4\n"
> +		    "	adcl 3*4(%[src]),%[res]\n   .align 4\n"
> +		    "	adcl 2*4(%[src]),%[res]\n   .align 4\n"
> +		    "	adcl 1*4(%[src]),%[res]\n   .align 4\n"
> +		    "	adcl 0*4(%[src]),%[res]\n"
> +		    "	adcl $0,%[res]"

If only the CPU would accept: REP ADCL (%%rsi), %[res]   :/

> +			: [res] "=r" (result), [dest] "=&r" (dest)
> +			: [src] "r" (buff), "[res]" (result),
> +			  [skip] "r" (dwords ^ 15)
> +			: "memory");
> +	}