netdev - Re: [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191125105337.GA14828@pc-9.home>
Date:   Mon, 25 Nov 2019 11:53:37 +0100
From:   Daniel Borkmann <daniel@...earbox.net>
To:     Björn Töpel <bjorn.topel@...il.com>
Cc:     netdev@...r.kernel.org, ast@...nel.org,
        Björn Töpel <bjorn.topel@...el.com>,
        bpf@...r.kernel.org, magnus.karlsson@...il.com,
        magnus.karlsson@...el.com, jonathan.lemon@...il.com,
        ecree@...arflare.com, thoiland@...hat.com,
        andrii.nakryiko@...il.com, tariqt@...lanox.com,
        saeedm@...lanox.com, maximmi@...lanox.com
Subject: Re: [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher

On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@...el.com>
> 
> The BPF dispatcher is a multiway branch code generator, mainly
> targeted for XDP programs. When an XDP program is executed via the
> bpf_prog_run_xdp(), it is invoked via an indirect call. With
> retpolines enabled, the indirect call has a substantial performance
> impact. The dispatcher is a mechanism that transform multiple indirect
> calls to direct calls, and therefore avoids the retpoline. The
> dispatcher is generated using the BPF JIT, and relies on text poking
> provided by bpf_arch_text_poke().
> 
> The dispatcher hijacks a trampoline function it via the __fentry__ nop
> of the trampoline. One dispatcher instance currently supports up to 16
> dispatch points. This can be extended in the future.
> 
> An example: A module/driver allocates a dispatcher. The dispatcher is
> shared for all netdevs. Each unique XDP program has a slot in the
> dispatcher, registered by a netdev. The netdev then uses the
> dispatcher to call the correct program with a direct call.
> 
> Signed-off-by: Björn Töpel <bjorn.topel@...el.com>
[...]
> +static int emit_bpf_dispatcher(u8 **pprog, int a, int b, s64 *progs)
> +{
> +	u8 *jg_reloc, *jg_target, *prog = *pprog;
> +	int pivot, err, jg_bytes = 1, cnt = 0;
> +	s64 jg_offset;
> +
> +	if (a == b) {
> +		/* Leaf node of recursion, i.e. not a range of indices
> +		 * anymore.
> +		 */
> +		EMIT1(add_1mod(0x48, BPF_REG_3));	/* cmp rdx,func */
> +		if (!is_simm32(progs[a]))
> +			return -1;
> +		EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3),
> +			    progs[a]);
> +		err = emit_cond_near_jump(&prog,	/* je func */
> +					  (void *)progs[a], prog,
> +					  X86_JE);
> +		if (err)
> +			return err;
> +
> +		err = emit_jump(&prog,			/* jmp thunk */
> +				__x86_indirect_thunk_rdx, prog);
> +		if (err)
> +			return err;
> +
> +		*pprog = prog;
> +		return 0;
> +	}
> +
> +	/* Not a leaf node, so we pivot, and recursively descend into
> +	 * the lower and upper ranges.
> +	 */
> +	pivot = (b - a) / 2;
> +	EMIT1(add_1mod(0x48, BPF_REG_3));		/* cmp rdx,func */
> +	if (!is_simm32(progs[a + pivot]))
> +		return -1;
> +	EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3), progs[a + pivot]);
> +
> +	if (pivot > 2) {				/* jg upper_part */
> +		/* Require near jump. */
> +		jg_bytes = 4;
> +		EMIT2_off32(0x0F, X86_JG + 0x10, 0);
> +	} else {
> +		EMIT2(X86_JG, 0);
> +	}
> +	jg_reloc = prog;
> +
> +	err = emit_bpf_dispatcher(&prog, a, a + pivot,	/* emit lower_part */
> +				  progs);
> +	if (err)
> +		return err;
> +
> +	/* Intel 64 and IA-32 ArchitecturesOptimization Reference
> +	 * Manual, 3.4.1.5 Code Alignment Assembly/Compiler Coding
> +	 * Rule 12. (M impact, H generality) All branch targets should
> +	 * be 16-byte aligned.

Isn't this section 3.4.1.4, rule 11 or are you reading a newer manual
than on the website [0]? :) Just wondering, in your IXIA tests, did you
see any noticeable slowdowns if you don't do the 16-byte alignments as
in the rest of the kernel [1,2]?

  [0] https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
  [1] be6cb02779ca ("x86: Align jump targets to 1-byte boundaries")
  [2] https://lore.kernel.org/patchwork/patch/560050/

> +	 */
> +	jg_target = PTR_ALIGN(prog, 16);
> +	if (jg_target != prog)
> +		emit_nops(&prog, jg_target - prog);
> +	jg_offset = prog - jg_reloc;
> +	emit_code(jg_reloc - jg_bytes, jg_offset, jg_bytes);
> +
> +	err = emit_bpf_dispatcher(&prog, a + pivot + 1,	/* emit upper_part */
> +				  b, progs);
> +	if (err)
> +		return err;
> +
> +	*pprog = prog;
> +	return 0;
> +}